주어진 HR 데이터를 통해 종업원 퇴사의 주된 원인을 탐색 및 시각화하여, 경영진에게 인사이트 및 문제에 대한 해결방안을 도출하는 case 입니다.
권장하는 진행 순서는 다음과 같습니다.
# 필요한 라이브러리 불러오기
# 데이터 핸들링을 위한 라이브러리
import numpy as np
import pandas as pd
# 데이터 시각화를 위한 라이브러리
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn') # matplotlib style 설정
# 문법 변경에 의한 경고를 무시
import warnings
warnings.filterwarnings('ignore')
# matplotlib 시각화 결과를 jupyter notebook에서 바로 볼 수 있게 해주는 command
%matplotlib inline
data = pd.read_csv('data/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
5 rows × 33 columns
# TotalWorkingYears 변수에서 YearsAtCompany를 빼 BeforeWorkingYears 변수 생성
data.loc[:,'BeforeWorkingYears']=data.TotalWorkingYears-data.YearsAtCompany
data.drop(['TotalWorkingYears'], axis=1, inplace=True)
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 1 | 80 | 0 | 0 | 1 | 6 | 4 | 0 | 5 | 2 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 4 | 80 | 1 | 3 | 3 | 10 | 7 | 1 | 7 | 0 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 2 | 80 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 7 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 3 | 80 | 0 | 3 | 3 | 8 | 7 | 3 | 0 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 4 | 80 | 1 | 3 | 3 | 2 | 2 | 2 | 2 | 4 |
5 rows × 33 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeNumber 1470 non-null int64 9 EnvironmentSatisfaction 1470 non-null int64 10 Gender 1470 non-null object 11 HourlyRate 1470 non-null int64 12 JobInvolvement 1470 non-null int64 13 JobLevel 1470 non-null int64 14 JobRole 1470 non-null object 15 JobSatisfaction 1470 non-null int64 16 MaritalStatus 1470 non-null object 17 MonthlyIncome 1470 non-null int64 18 MonthlyRate 1470 non-null int64 19 NumCompaniesWorked 1470 non-null int64 20 OverTime 1470 non-null object 21 PercentSalaryHike 1470 non-null int64 22 PerformanceRating 1470 non-null int64 23 RelationshipSatisfaction 1470 non-null int64 24 StandardHours 1470 non-null int64 25 StockOptionLevel 1470 non-null int64 26 TrainingTimesLastYear 1470 non-null int64 27 WorkLifeBalance 1470 non-null int64 28 YearsAtCompany 1470 non-null int64 29 YearsInCurrentRole 1470 non-null int64 30 YearsSinceLastPromotion 1470 non-null int64 31 YearsWithCurrManager 1470 non-null int64 32 BeforeWorkingYears 1470 non-null int64 dtypes: int64(25), object(8) memory usage: 379.1+ KB
# Attrition 1=1, NO=0
data.loc[data.Attrition=='Yes','Attrition']=1
data.loc[data.Attrition=='No','Attrition']=0
data['Attrition']=data['Attrition'].astype('float')
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1.0 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 1 | 80 | 0 | 0 | 1 | 6 | 4 | 0 | 5 | 2 |
| 1 | 49 | 0.0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 4 | 80 | 1 | 3 | 3 | 10 | 7 | 1 | 7 | 0 |
| 2 | 37 | 1.0 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 2 | 80 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 7 |
| 3 | 33 | 0.0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 3 | 80 | 0 | 3 | 3 | 8 | 7 | 3 | 0 | 0 |
| 4 | 27 | 0.0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 4 | 80 | 1 | 3 | 3 | 2 | 2 | 2 | 2 | 4 |
5 rows × 33 columns
data.describe()
| Age | Attrition | DailyRate | DistanceFromHome | Education | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | ... | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
| mean | 36.923810 | 0.161224 | 802.485714 | 9.192517 | 2.912925 | 1024.865306 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | ... | 2.712245 | 80.0 | 0.793878 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 | 4.271429 |
| std | 9.135373 | 0.367863 | 403.509100 | 8.106864 | 1.024165 | 602.024335 | 1.093082 | 20.329428 | 0.711561 | 1.106940 | ... | 1.081209 | 0.0 | 0.852077 | 1.289271 | 0.706476 | 6.126525 | 3.623137 | 3.222430 | 3.568136 | 6.179783 |
| min | 18.000000 | 0.000000 | 102.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 80.0 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 30.000000 | 0.000000 | 465.000000 | 2.000000 | 2.000000 | 491.250000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | ... | 2.000000 | 80.0 | 0.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 | 0.000000 |
| 50% | 36.000000 | 0.000000 | 802.000000 | 7.000000 | 3.000000 | 1020.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | ... | 3.000000 | 80.0 | 1.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 | 2.000000 |
| 75% | 43.000000 | 0.000000 | 1157.000000 | 14.000000 | 4.000000 | 1555.750000 | 4.000000 | 83.750000 | 3.000000 | 3.000000 | ... | 4.000000 | 80.0 | 1.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 | 5.000000 |
| max | 60.000000 | 1.000000 | 1499.000000 | 29.000000 | 5.000000 | 2068.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | ... | 4.000000 | 80.0 | 3.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 | 33.000000 |
8 rows × 26 columns
#결측값 및 이상치 확인
data.isnull().sum()
Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EmployeeNumber 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StandardHours 0 StockOptionLevel 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 BeforeWorkingYears 0 dtype: int64
통제할 수 있는 변인으로 해결 할 수 있다.
사람을 뽑기 전에 알 수 있는 데이터(교육수준)로 알 수 있을 거 같다.
columns 을 분류할 수 있을 거 같다. 경영진이 control 가능한 변수, 경영진이 control 할 수 없는 변수.
control 할 수 있는것
퇴사 여부, 출장 빈도, 일급, 부서, 직무수준, 직무역할
control 할 수 없는것
교육 수준, 교육 분야
업무 환경 만족도, 직무 만족도, > 너무 당연함
핵심인재의 자발적 이직 관리가 회사의 경영진 그룹에서 가장 중요함. 핵심적 인재의 특징 > 다른 그룹과 비교 핵심인재를 어떻게 정의 할 수 있을까?
- 임금상승률이 높은 사람들로 간접적으로 알 수 있을것이라 생각
# 그래프 서식 설정
parameters = {
'axes.titlesize': 25,
'axes.labelsize': 20,
'ytick.labelsize': 20
}
plt.rcParams.update(parameters)
f, ax= plt.subplots(1,2,figsize=(20,10))
data['Attrition'].value_counts().plot.pie(autopct='%1.1f%%',
ax=ax[0],
fontsize=20)
sns.countplot("Attrition", data=data, ax=ax[1])
plt.show()
회사원 중 퇴사자 비율은 16.1%로 237명
재직 비율은 83.9%로 1233명이다.
# age 분포 확인
plt.figure(figsize=(20,10))
sns.countplot(data=data,x='Age')
plt.xticks(rotation=30)
plt.show()
# age_band 만들기
data['Age_band'] = 0
data.loc[data['Age'] <= 25, 'Age_band'] = 0
data.loc[(data['Age'] > 25) & (data['Age'] <= 33), 'Age_band'] = 1
data.loc[(data['Age'] > 33) & (data['Age'] <= 41), 'Age_band'] = 2
data.loc[(data['Age'] > 41) & (data['Age'] <= 49), 'Age_band'] = 3
data.loc[data['Age'] > 49, 'Age_band'] = 4
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | StandardHours | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | Age_band | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1.0 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 80 | 0 | 0 | 1 | 6 | 4 | 0 | 5 | 2 | 2 |
| 1 | 49 | 0.0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 80 | 1 | 3 | 3 | 10 | 7 | 1 | 7 | 0 | 3 |
| 2 | 37 | 1.0 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 80 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 7 | 2 |
| 3 | 33 | 0.0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 80 | 0 | 3 | 3 | 8 | 7 | 3 | 0 | 0 | 1 |
| 4 | 27 | 0.0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 80 | 1 | 3 | 3 | 2 | 2 | 2 | 2 | 4 | 1 |
5 rows × 34 columns
# band별 분포 확인
plt.figure(figsize=(20,10))
sns.countplot(data=data,x='Age_band')
plt.show()
# age_band별 퇴사 인원
sns.countplot('Age_band', hue='Attrition', data=data)
<AxesSubplot:xlabel='Age_band', ylabel='count'>
pd.crosstab(data.Attrition,
data.Age_band, margins=True).style.background_gradient(
cmap='summer_r')
| Age_band | 0 | 1 | 2 | 3 | 4 | All |
|---|---|---|---|---|---|---|
| Attrition | ||||||
| 0.0 | 79 | 354 | 421 | 229 | 150 | 1233 |
| 1.0 | 44 | 97 | 50 | 23 | 23 | 237 |
| All | 123 | 451 | 471 | 252 | 173 | 1470 |
sns.lineplot('Age', 'Attrition', data=data)
<AxesSubplot:xlabel='Age', ylabel='Attrition'>
sns.factorplot(x='Age_band', y='Attrition', data=data)
plt.show()
print('Age_band 0의 퇴사율: ', data[data.Age_band==0].Attrition.mean())
print('Age_band 1의 퇴사율: ', data[data.Age_band==1].Attrition.mean())
print('Age_band 2의 퇴사율: ', data[data.Age_band==2].Attrition.mean())
print('Age_band 3의 퇴사율: ', data[data.Age_band==3].Attrition.mean())
print('Age_band 4의 퇴사율: ', data[data.Age_band==4].Attrition.mean())
Age_band 0의 퇴사율: 0.35772357723577236 Age_band 1의 퇴사율: 0.21507760532150777 Age_band 2의 퇴사율: 0.10615711252653928 Age_band 3의 퇴사율: 0.09126984126984126 Age_band 4의 퇴사율: 0.1329479768786127
# Education
plt.figure(figsize=(10,6))
sns.violinplot('Education', 'Age', hue='Attrition', data=data, split=True)
<AxesSubplot:xlabel='Education', ylabel='Age'>
# Education Field
plt.figure(figsize=(10,6))
sns.violinplot('EducationField', 'Age', hue='Attrition', data=data, split=True)
plt.xticks(rotation=50)
plt.show()
# Gender
plt.figure(figsize=(15,8))
sns.lineplot('Age', 'Attrition', hue='Gender', data=data)
plt.show()
# Marital Status
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('Age', 'Attrition', hue='MaritalStatus', data=data, ax=ax[0])
sns.violinplot('MaritalStatus', 'Age', hue='Attrition', data=data, split=True, ax=ax[1])
plt.show()
sns.countplot('Age_band', hue='MaritalStatus', data=data)
<AxesSubplot:xlabel='Age_band', ylabel='count'>
# NumCompaniesWorked
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('Age', 'NumCompaniesWorked', hue='Attrition', data=data, ax=ax[0])
plt.scatter(data.loc[data.Attrition==1,'Age'], data.loc[data.Attrition==1,'NumCompaniesWorked'], c = 'pink', alpha = 0.6,
linewidths = 0.7, edgecolors = 'red', label = '퇴사')
plt.scatter(data.loc[data.Attrition==0,'Age'], data.loc[data.Attrition==0,'NumCompaniesWorked'], c = 'cyan', alpha = 0.6,
linewidths = 0.5, edgecolors = 'blue', label = '재직')
plt.show()
plt.figure(figsize=(15,8))
sns.violinplot('NumCompaniesWorked', 'Age', hue='Attrition', data=data, split=True)
<AxesSubplot:xlabel='NumCompaniesWorked', ylabel='Age'>
sns.factorplot('Age', 'Attrition', col='NumCompaniesWorked', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff723a49550>
# BeforeWorkingYears
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('Age', 'BeforeWorkingYears', hue='Attrition', data=data, ax=ax[0])
plt.scatter(data.loc[data.Attrition==1,'Age'], data.loc[data.Attrition==1,'BeforeWorkingYears'], c = 'pink', alpha = 0.6,
linewidths = 0.7, edgecolors = 'red', label = '퇴사')
plt.scatter(data.loc[data.Attrition==0,'Age'], data.loc[data.Attrition==0,'BeforeWorkingYears'], c = 'cyan', alpha = 0.6,
linewidths = 0.5, edgecolors = 'blue', label = '재직')
<matplotlib.collections.PathCollection at 0x7ff7235e5310>
plt.figure(figsize=(15,8))
plt.scatter(data['Age'], # x축
data['BeforeWorkingYears'], # y축
s = (data['Attrition']+1)*50, # 사이즈
c = 'green', # 색깔(고정)
alpha = 0.3) # 투명도
plt.xlabel('Age', size = 12)
plt.ylabel('BeforeWorkingYears', size = 12)
Text(0, 0.5, 'BeforeWorkingYears')
1: 'Below College' 2: 'College' 3: 'Bachelor' 4: 'Master' 5: 'Doctor'
# 분포 확인
sns.countplot(data=data, x='Education')
plt.show()
# 교육수준 별 퇴사 여부
sns.countplot('Education', hue='Attrition', data=data)
<AxesSubplot:xlabel='Education', ylabel='count'>
pd.crosstab(data.Attrition,
data.Education, margins=True).style.background_gradient(
cmap='summer_r')
| Education | 1 | 2 | 3 | 4 | 5 | All |
|---|---|---|---|---|---|---|
| Attrition | ||||||
| 0.0 | 139 | 238 | 473 | 340 | 43 | 1233 |
| 1.0 | 31 | 44 | 99 | 58 | 5 | 237 |
| All | 170 | 282 | 572 | 398 | 48 | 1470 |
sns.factorplot(x='Education', y='Attrition', data=data)
plt.show()
print('Below College의 퇴사율: ', data[data.Education==1].Attrition.mean())
print('College의 퇴사율: ', data[data.Education==2].Attrition.mean())
print('Bachelor의 퇴사율: ', data[data.Education==3].Attrition.mean())
print('Master의 퇴사율: ', data[data.Education==4].Attrition.mean())
print('Doctor의 퇴사율: ', data[data.Education==5].Attrition.mean())
Below College의 퇴사율: 0.18235294117647058 College의 퇴사율: 0.15602836879432624 Bachelor의 퇴사율: 0.17307692307692307 Master의 퇴사율: 0.1457286432160804 Doctor의 퇴사율: 0.10416666666666667
# Education Field
sns.factorplot('Education', 'Attrition', col='EducationField', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff70034dd30>
# Gender
sns.factorplot('Education', 'Attrition', col='Gender', data=data)
sns.factorplot('Gender', 'Attrition', col='Education', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff7116af040>
# MaritalStatus
sns.factorplot('Education', 'Attrition', col='MaritalStatus', data=data)
sns.factorplot('MaritalStatus', 'Attrition', col='Education', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff710ded970>
# NumCompaniesWorked
sns.factorplot('NumCompaniesWorked', 'Attrition', col='Education', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff710d207c0>
# BeforeWorkingYears_band 만들기
data['BeforeWorkingYears_band'] = 0
data.loc[data['BeforeWorkingYears'] <= 5, 'BeforeWorkingYears_band'] = 0
data.loc[(data['BeforeWorkingYears'] > 5) & (data['BeforeWorkingYears'] <= 10), 'BeforeWorkingYears_band'] = 1
data.loc[(data['BeforeWorkingYears'] > 10) & (data['BeforeWorkingYears'] <= 15), 'BeforeWorkingYears_band'] = 2
data.loc[(data['BeforeWorkingYears'] > 15) & (data['BeforeWorkingYears'] <= 20), 'BeforeWorkingYears_band'] = 3
data.loc[(data['BeforeWorkingYears'] > 20) & (data['BeforeWorkingYears'] <= 25), 'BeforeWorkingYears_band'] = 4
data.loc[data['BeforeWorkingYears'] > 25, 'BeforeWorkingYears_band'] = 5
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | Age_band | BeforeWorkingYears_band | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1.0 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 0 | 0 | 1 | 6 | 4 | 0 | 5 | 2 | 2 | 0 |
| 1 | 49 | 0.0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 1 | 3 | 3 | 10 | 7 | 1 | 7 | 0 | 3 | 0 |
| 2 | 37 | 1.0 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 7 | 2 | 1 |
| 3 | 33 | 0.0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 0 | 3 | 3 | 8 | 7 | 3 | 0 | 0 | 1 | 0 |
| 4 | 27 | 0.0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 1 | 3 | 3 | 2 | 2 | 2 | 2 | 4 | 1 | 0 |
5 rows × 35 columns
# BeforeWorkingYears
plt.figure(figsize=(15,8))
sns.countplot('BeforeWorkingYears_band', hue='Education', data=data)
<AxesSubplot:xlabel='BeforeWorkingYears_band', ylabel='count'>
sns.factorplot('BeforeWorkingYears_band', 'Attrition', col='Education', data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff7096639a0>
data.EducationField.unique()
array(['Life Sciences', 'Other', 'Medical', 'Marketing',
'Technical Degree', 'Human Resources'], dtype=object)
# 분포 확인
sns.countplot(data=data, x='EducationField')
plt.xticks(rotation=50)
plt.show()
# 교육 분야 별 퇴사 여부
sns.countplot('EducationField', hue='Attrition', data=data)
plt.xticks(rotation=50)
plt.show()
pd.crosstab(data.Attrition,
data.EducationField, margins=True).style.background_gradient(
cmap='summer_r')
| EducationField | Human Resources | Life Sciences | Marketing | Medical | Other | Technical Degree | All |
|---|---|---|---|---|---|---|---|
| Attrition | |||||||
| 0.0 | 20 | 517 | 124 | 401 | 71 | 100 | 1233 |
| 1.0 | 7 | 89 | 35 | 63 | 11 | 32 | 237 |
| All | 27 | 606 | 159 | 464 | 82 | 132 | 1470 |
sns.factorplot(x='EducationField', y='Attrition', data=data)
plt.xticks(rotation=50)
plt.show()
print('Human Resources의 퇴사율: ', data[data.EducationField=='Human Resources'].Attrition.mean())
print('Life Sciences의 퇴사율: ', data[data.EducationField=='Life Sciences'].Attrition.mean())
print('Marketing의 퇴사율: ', data[data.EducationField=='Marketing'].Attrition.mean())
print('Medical의 퇴사율: ', data[data.EducationField=='Medical'].Attrition.mean())
print('Technical Degree의 퇴사율: ', data[data.EducationField=='Technical Degree'].Attrition.mean())
print('Other의 퇴사율: ', data[data.EducationField=='Other'].Attrition.mean())
Human Resources의 퇴사율: 0.25925925925925924 Life Sciences의 퇴사율: 0.14686468646864687 Marketing의 퇴사율: 0.22012578616352202 Medical의 퇴사율: 0.13577586206896552 Technical Degree의 퇴사율: 0.24242424242424243 Other의 퇴사율: 0.13414634146341464
# Gender
plt.figure(figsize=(15,8))
sns.countplot('EducationField', hue='Gender', data=data)
plt.xticks(rotation=50)
plt.show()
# Gender
sns.factorplot('Gender', 'Attrition', col='EducationField', data=data,ax=ax[0])
sns.factorplot('EducationField', 'Attrition', col='Gender', data=data,ax=ax[1])
plt.xticks(rotation=50)
plt.show()
# NumCompaniesWorked
plt.figure(figsize=(15,8))
sns.countplot('NumCompaniesWorked', hue='EducationField', data=data)
plt.show()
sns.factorplot('NumCompaniesWorked', 'Attrition', col='EducationField', data=data)
sns.factorplot('EducationField', 'Attrition', col='NumCompaniesWorked', data=data)
plt.xticks(rotation=50)
plt.show()
# BeforeWorkingYears
plt.figure(figsize=(15,8))
sns.countplot('BeforeWorkingYears_band', hue='EducationField', data=data)
plt.show()
f, ax = plt.subplots(2, 1, figsize=(15, 10))
sns.violinplot('EducationField', 'BeforeWorkingYears', hue='Attrition', data=data, split=True,ax=ax[0])
sns.barplot('EducationField', 'BeforeWorkingYears', hue='Attrition', data=data,ax=ax[1])
plt.xticks(rotation=50)
plt.show()
# 분포 확인
sns.countplot(data=data, x='Gender')
plt.show()
# 성별 별 퇴사 여부
sns.countplot('Gender', hue='Attrition', data=data)
<AxesSubplot:xlabel='Gender', ylabel='count'>
pd.crosstab(data.Attrition,
data.Gender, margins=True).style.background_gradient(
cmap='summer_r')
| Gender | Female | Male | All |
|---|---|---|---|
| Attrition | |||
| 0.0 | 501 | 732 | 1233 |
| 1.0 | 87 | 150 | 237 |
| All | 588 | 882 | 1470 |
sns.factorplot(x='Gender', y='Attrition', data=data)
plt.xticks(rotation=50)
plt.show()
print('남성의 퇴사율: ', data[data.Gender=='Male'].Attrition.mean())
print('여성의 퇴사율: ', data[data.Gender=='Female'].Attrition.mean())
남성의 퇴사율: 0.17006802721088435 여성의 퇴사율: 0.14795918367346939
# MaritalStatus
plt.figure(figsize=(15,8))
sns.countplot('MaritalStatus', hue='Gender', data=data)
plt.show()
sns.factorplot('MaritalStatus', 'Attrition', col='Gender', data=data)
sns.factorplot('Gender', 'Attrition', col='MaritalStatus', data=data)
plt.show()
# NumCompaniesWorked
plt.figure(figsize=(15,8))
sns.countplot('NumCompaniesWorked', hue='Gender', data=data)
plt.show()
sns.factorplot('NumCompaniesWorked', 'Attrition', col='Gender', data=data)
sns.factorplot('Gender', 'Attrition', col='NumCompaniesWorked', data=data)
plt.show()
plt.figure(figsize=(15,8))
sns.lineplot('NumCompaniesWorked', 'Attrition', hue='Gender', data=data)
<AxesSubplot:xlabel='NumCompaniesWorked', ylabel='Attrition'>
# BeforeWorkingYears
plt.figure(figsize=(15,8))
sns.countplot('BeforeWorkingYears_band', hue='Gender', data=data)
plt.show()
sns.factorplot('BeforeWorkingYears_band', 'Attrition', col='Gender', data=data)
sns.factorplot('Gender', 'Attrition', col='BeforeWorkingYears_band', data=data)
plt.show()
plt.figure(figsize=(15,8))
sns.lineplot('BeforeWorkingYears', 'Attrition', hue='Gender', data=data)
<AxesSubplot:xlabel='BeforeWorkingYears', ylabel='Attrition'>
data.MaritalStatus.unique()
array(['Single', 'Married', 'Divorced'], dtype=object)
# 분포 확인
sns.countplot(data=data, x='MaritalStatus')
plt.show()
# 결혼상태 별 퇴사 여부
sns.countplot('MaritalStatus', hue='Attrition', data=data)
plt.show()
pd.crosstab(data.Attrition,
data.MaritalStatus, margins=True).style.background_gradient(
cmap='summer_r')
| MaritalStatus | Divorced | Married | Single | All |
|---|---|---|---|---|
| Attrition | ||||
| 0.0 | 294 | 589 | 350 | 1233 |
| 1.0 | 33 | 84 | 120 | 237 |
| All | 327 | 673 | 470 | 1470 |
sns.factorplot(x='MaritalStatus', y='Attrition', data=data)
plt.show()
print('미혼의 퇴사율: ', data[data.MaritalStatus=='Single'].Attrition.mean())
print('기혼의 퇴사율: ', data[data.MaritalStatus=='Married'].Attrition.mean())
print('이혼의 퇴사율: ', data[data.MaritalStatus=='Divorced'].Attrition.mean())
미혼의 퇴사율: 0.2553191489361702 기혼의 퇴사율: 0.12481426448736999 이혼의 퇴사율: 0.10091743119266056
# NumCompaniesWorked
plt.figure(figsize=(15,8))
sns.countplot('NumCompaniesWorked', hue='MaritalStatus', data=data)
plt.show()
sns.factorplot('NumCompaniesWorked', 'Attrition', col='MaritalStatus', data=data)
sns.factorplot('MaritalStatus', 'Attrition', col='NumCompaniesWorked', data=data)
plt.show()
plt.figure(figsize=(15,8))
sns.lineplot('NumCompaniesWorked', 'Attrition', hue='MaritalStatus', data=data)
<AxesSubplot:xlabel='NumCompaniesWorked', ylabel='Attrition'>
# BeforeWorkingYears
plt.figure(figsize=(15,8))
sns.countplot('BeforeWorkingYears_band', hue='MaritalStatus', data=data)
plt.show()
sns.factorplot('BeforeWorkingYears_band', 'Attrition', col='MaritalStatus', data=data)
sns.factorplot('MaritalStatus', 'Attrition', col='BeforeWorkingYears_band', data=data)
plt.show()
print(data.MaritalStatus.value_counts())
print(data.loc[(data.MaritalStatus=='Divorced')&(data.BeforeWorkingYears_band==3), 'BeforeWorkingYears_band'].count())
Married 673 Single 470 Divorced 327 Name: MaritalStatus, dtype: int64 15
plt.figure(figsize=(15,8))
sns.lineplot('BeforeWorkingYears', 'Attrition', hue='MaritalStatus', data=data)
<AxesSubplot:xlabel='BeforeWorkingYears', ylabel='Attrition'>
# 분포 확인
sns.countplot(data=data, x='NumCompaniesWorked')
plt.show()
# 일한 회사 수 별 퇴사 여부
sns.countplot('NumCompaniesWorked', hue='Attrition', data=data)
<AxesSubplot:xlabel='NumCompaniesWorked', ylabel='count'>
sns.distplot(data[data['Attrition'] == 1].NumCompaniesWorked, color='r')
sns.distplot(data[data['Attrition'] == 0].NumCompaniesWorked, color='b')
<AxesSubplot:xlabel='NumCompaniesWorked', ylabel='Density'>
pd.crosstab(data.Attrition,
data.NumCompaniesWorked, margins=True).style.background_gradient(
cmap='summer_r')
| NumCompaniesWorked | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | All |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition | |||||||||||
| 0.0 | 174 | 423 | 130 | 143 | 122 | 47 | 54 | 57 | 43 | 40 | 1233 |
| 1.0 | 23 | 98 | 16 | 16 | 17 | 16 | 16 | 17 | 6 | 12 | 237 |
| All | 197 | 521 | 146 | 159 | 139 | 63 | 70 | 74 | 49 | 52 | 1470 |
sns.lineplot(x='NumCompaniesWorked', y='Attrition', data=data)
plt.show()
print('0의 퇴사율: ', data[data.NumCompaniesWorked==0].Attrition.mean())
print('1의 퇴사율: ', data[data.NumCompaniesWorked==1].Attrition.mean())
print('2의 퇴사율: ', data[data.NumCompaniesWorked==2].Attrition.mean())
print('3의 퇴사율: ', data[data.NumCompaniesWorked==3].Attrition.mean())
print('4의 퇴사율: ', data[data.NumCompaniesWorked==4].Attrition.mean())
print('5의 퇴사율: ', data[data.NumCompaniesWorked==5].Attrition.mean())
print('6의 퇴사율: ', data[data.NumCompaniesWorked==6].Attrition.mean())
print('7의 퇴사율: ', data[data.NumCompaniesWorked==7].Attrition.mean())
print('8의 퇴사율: ', data[data.NumCompaniesWorked==8].Attrition.mean())
print('9의 퇴사율: ', data[data.NumCompaniesWorked==9].Attrition.mean())
0의 퇴사율: 0.116751269035533 1의 퇴사율: 0.18809980806142035 2의 퇴사율: 0.1095890410958904 3의 퇴사율: 0.10062893081761007 4의 퇴사율: 0.1223021582733813 5의 퇴사율: 0.25396825396825395 6의 퇴사율: 0.22857142857142856 7의 퇴사율: 0.22972972972972974 8의 퇴사율: 0.12244897959183673 9의 퇴사율: 0.23076923076923078
data.BeforeWorkingYears.describe()
count 1470.000000 mean 4.271429 std 6.179783 min 0.000000 25% 0.000000 50% 2.000000 75% 5.000000 max 33.000000 Name: BeforeWorkingYears, dtype: float64
np.sort(data.BeforeWorkingYears.unique())
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33])
# 분포 확인
sns.countplot(data=data, x='BeforeWorkingYears')
plt.show()
# BeforeWorkingYears_band 만들기
data['BeforeWorkingYears_band'] = 0
data.loc[data['BeforeWorkingYears'] <= 5, 'BeforeWorkingYears_band'] = 0
data.loc[(data['BeforeWorkingYears'] > 5) & (data['BeforeWorkingYears'] <= 10), 'BeforeWorkingYears_band'] = 1
data.loc[(data['BeforeWorkingYears'] > 10) & (data['BeforeWorkingYears'] <= 15), 'BeforeWorkingYears_band'] = 2
data.loc[(data['BeforeWorkingYears'] > 15) & (data['BeforeWorkingYears'] <= 20), 'BeforeWorkingYears_band'] = 3
data.loc[(data['BeforeWorkingYears'] > 20) & (data['BeforeWorkingYears'] <= 25), 'BeforeWorkingYears_band'] = 4
data.loc[data['BeforeWorkingYears'] > 25, 'BeforeWorkingYears_band'] = 5
data.head(5)
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | StockOptionLevel | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | Age_band | BeforeWorkingYears_band | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | 1.0 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 2 | ... | 0 | 0 | 1 | 6 | 4 | 0 | 5 | 2 | 2 | 0 |
| 1 | 49 | 0.0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 2 | 3 | ... | 1 | 3 | 3 | 10 | 7 | 1 | 7 | 0 | 3 | 0 |
| 2 | 37 | 1.0 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | 4 | ... | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 7 | 2 | 1 |
| 3 | 33 | 0.0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 5 | 4 | ... | 0 | 3 | 3 | 8 | 7 | 3 | 0 | 0 | 1 | 0 |
| 4 | 27 | 0.0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 7 | 1 | ... | 1 | 3 | 3 | 2 | 2 | 2 | 2 | 4 | 1 | 0 |
5 rows × 35 columns
# 분포 확인
sns.countplot(data=data, x='BeforeWorkingYears_band')
plt.show()
# 경력 별 퇴사 여부
sns.countplot('BeforeWorkingYears_band', hue='Attrition', data=data)
<AxesSubplot:xlabel='BeforeWorkingYears_band', ylabel='count'>
pd.crosstab(data.Attrition,
data.BeforeWorkingYears_band, margins=True).style.background_gradient(
cmap='summer_r')
| BeforeWorkingYears_band | 0 | 1 | 2 | 3 | 4 | 5 | All |
|---|---|---|---|---|---|---|---|
| Attrition | |||||||
| 0.0 | 906 | 153 | 66 | 50 | 36 | 22 | 1233 |
| 1.0 | 198 | 20 | 8 | 7 | 4 | 0 | 237 |
| All | 1104 | 173 | 74 | 57 | 40 | 22 | 1470 |
sns.lineplot(x='BeforeWorkingYears', y='Attrition', data=data)
plt.show()
sns.factorplot(x='BeforeWorkingYears_band', y='Attrition', data=data)
plt.show()
print('BeforeWorkingYears_band 0의 퇴사율: ', data[data.BeforeWorkingYears_band==0].Attrition.mean())
print('BeforeWorkingYears_band 1의 퇴사율: ', data[data.BeforeWorkingYears_band==1].Attrition.mean())
print('BeforeWorkingYears_band 2의 퇴사율: ', data[data.BeforeWorkingYears_band==2].Attrition.mean())
print('BeforeWorkingYears_band 3의 퇴사율: ', data[data.BeforeWorkingYears_band==3].Attrition.mean())
print('BeforeWorkingYears_band 4의 퇴사율: ', data[data.BeforeWorkingYears_band==4].Attrition.mean())
print('BeforeWorkingYears_band 5의 퇴사율: ', data[data.BeforeWorkingYears_band==5].Attrition.mean())
BeforeWorkingYears_band 0의 퇴사율: 0.1793478260869565 BeforeWorkingYears_band 1의 퇴사율: 0.11560693641618497 BeforeWorkingYears_band 2의 퇴사율: 0.10810810810810811 BeforeWorkingYears_band 3의 퇴사율: 0.12280701754385964 BeforeWorkingYears_band 4의 퇴사율: 0.1 BeforeWorkingYears_band 5의 퇴사율: 0.0
Guide line 일급, 시급, 월급 , 월소득, 그냥 합쳐서 하나로 만들어도 될거 같다. 아니면 월소득, 월급 만 써도 될거같다 분포가 월급 시급 일급이 같다 일급+ 시급+ 월급 + 월소득 4개의 평균을 같게 만들어주고 합친다음에 한 사람의 임금 수준을 나타낼 수 있음
모든 변수가 int 로 되어 있기 때문에 효율적인 시각화를 위해서 그룹을 나눌 예정 임금 상승률이 높은 그룹은 핵심 인재 그룹이라고 할 수 있을 것이다.
임금 상승률이 높은 그룹중에
#Rate , MonthlyIncome 확인
import matplotlib.pylab as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# col = ['DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate'
# , 'PercentSalaryHike', 'StockOptionLevel']
# data.info()
sns.distplot(data['DailyRate']/np.mean(data['DailyRate']))
sns.distplot(data['HourlyRate']/np.mean(data['HourlyRate']))
sns.distplot(data['MonthlyRate']/np.mean(data['MonthlyRate']))
sns.distplot(data['MonthlyIncome']/np.mean(data['MonthlyIncome']))
# 시급만 분포가 조금더 좁음
# 일급과, 월급은 분포가 똑같음
print("\n")
print(data.groupby('Attrition')[['HourlyRate']].agg(['count', 'max', 'min', 'mean', 'median']))
print("\n")
print(data.groupby('Attrition')[['MonthlyRate']].agg(['count', 'max', 'min', 'mean', 'median']))
print("\n")
print(data.groupby('Attrition')[['MonthlyIncome']].agg(['count', 'max', 'min', 'mean', 'median']))
#Monthly Income과 Attrition은 음의 상관관계
HourlyRate
count max min mean median
Attrition
0.0 1233 100 30 65.952149 66
1.0 237 100 31 65.573840 66
MonthlyRate
count max min mean median
Attrition
0.0 1233 26997 2094 14265.779400 14120
1.0 237 26999 2326 14559.308017 14618
MonthlyIncome
count max min mean median
Attrition
0.0 1233 19999 1051 6832.739659 5204
1.0 237 19859 1009 4787.092827 3202
세 변수중 MonthlyRate 만 그룹을 나눠서 사용할 계획
MonthlyIncome 은 유의미한 변수로 보임
#임금상승률
sns.distplot(data['PercentSalaryHike'])
print(data.groupby('Attrition')[['PercentSalaryHike']].agg(['count', 'max', 'min', 'mean', 'median']))
sns.factorplot('Attrition', 'PercentSalaryHike', data=data)
#Attrition에 PercentSalaryHike 는 큰 영향을 미치지 않아 보임
#다른 변수와 함께보아야 할것으로 보임
PercentSalaryHike
count max min mean median
Attrition
0.0 1233 25 11 15.231144 14
1.0 237 25 11 15.097046 14
<seaborn.axisgrid.FacetGrid at 0x7ff709a77460>
# DistanceFromHome
sns.distplot(data['DistanceFromHome'])
print(data.groupby('Attrition')[['DistanceFromHome']].agg(['count', 'max', 'min', 'mean', 'median']))
sns.factorplot('Attrition', 'DistanceFromHome', data=data)
#DistanceFromHome과 Attrition은 양의 상관관계
DistanceFromHome
count max min mean median
Attrition
0.0 1233 29 1 8.915653 7
1.0 237 29 1 10.632911 9
<seaborn.axisgrid.FacetGrid at 0x7ff709c296a0>
#StockOptionLevel
sns.distplot(data['StockOptionLevel'])
print(data.groupby('Attrition')[['StockOptionLevel']].agg(['count', 'max', 'min', 'mean', 'median']))
sns.factorplot('Attrition', 'StockOptionLevel', data=data)
#Attrion 과 stockoptionlevel 은 음의 상관관계
StockOptionLevel
count max min mean median
Attrition
0.0 1233 3 0 0.845093 1
1.0 237 3 0 0.527426 0
<seaborn.axisgrid.FacetGrid at 0x7ff70a6ef430>
#연속형 변수 범주형 변수로 바꿔줌
data['Rate_range'] = pd.qcut(data['MonthlyRate'], q=4, labels=[1,2,3,4])
data['Income_range'] = pd.qcut(data['MonthlyIncome'], q=4,labels=[1,2,3,4])
data['SalaryHike_range'] = pd.qcut(data['PercentSalaryHike'], q=4,labels=[1,2,3,4])
data['HomeDistance_range'] = pd.qcut(data['DistanceFromHome'], q=4, labels=[1,2,3,4])
data['Rate_range']=data['Rate_range'].astype('int')
data['Income_range']=data['Income_range'].astype('int')
data['SalaryHike_range']=data['SalaryHike_range'].astype('int')
data['HomeDistance_range']=data['HomeDistance_range'].astype('int')
data_talent = data[data['SalaryHike_range']==4]
data_normal = data[data['SalaryHike_range'].isin([1,2,3])]
#우선 신입사원이라 임금 상승률이 높은 것인지 확인
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent['YearsAtCompany'],ax=ax[0])
sns.distplot(data_normal['YearsAtCompany'],ax=ax[1])
print("data_talent_근속년수 평균 :", data_talent['YearsAtCompany'].mean())
print("data_normal_근속년수 평균 :", data_normal['YearsAtCompany'].mean())
print("\n")
print(data_talent.groupby('Attrition')[['YearsAtCompany']].agg(['count', 'max', 'min', 'mean', 'median']))
print("\n")
print(data_normal.groupby('Attrition')[['YearsAtCompany']].agg(['count', 'max', 'min', 'mean', 'median']))
data_talent_근속년수 평균 : 6.824503311258278
data_normal_근속년수 평균 : 7.055650684931507
YearsAtCompany
count max min mean median
Attrition
0.0 256 36 0 7.316406 5.5
1.0 46 15 0 4.086957 3.0
YearsAtCompany
count max min mean median
Attrition
0.0 977 37 0 7.382805 6
1.0 191 40 0 5.382199 4
#퇴사율 자체는 비슷하다.
f, ax = plt.subplots(1,2, figsize=(20,20))
data_talent['Attrition'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],shadow=True, fontsize=20)
data_normal['Attrition'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1],shadow=True, fontsize=20)
<AxesSubplot:ylabel='Attrition'>
data_talent["OverTime"].unique()
array(['No', 'Yes'], dtype=object)
data_talent["OverTime"].replace("Yes", 1, inplace=True)
data_talent["OverTime"].replace("No", 0, inplace=True)
data_normal["OverTime"].replace("Yes", 1, inplace=True)
data_normal["OverTime"].replace("No", 0, inplace=True)
data_talent['OverTime'].unique()
array([0, 1])
sns.factorplot('Attrition','OverTime',data=data_normal)
plt.show()
sns.factorplot('Attrition','OverTime',data=data_talent)
plt.show()
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['OverTime'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['OverTime'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['OverTime'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['OverTime'],ax=ax[1])
print(data_talent.groupby('Attrition')[['OverTime']].agg(['count', 'max', 'min', 'mean', 'median']))
print(data_normal.groupby('Attrition')[['OverTime']].agg(['count', 'max', 'min', 'mean', 'median']))
OverTime
count max min mean median
Attrition
0.0 256 1 0 0.222656 0
1.0 46 1 0 0.630435 1
OverTime
count max min mean median
Attrition
0.0 977 1 0 0.237462 0
1.0 191 1 0 0.513089 1
data_talent['StockOptionLevel'].unique()
array([1, 3, 0, 2])
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['StockOptionLevel'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['StockOptionLevel'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['StockOptionLevel'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['StockOptionLevel'],ax=ax[1])
print(data_talent.groupby('Attrition')[['StockOptionLevel']].agg(['count', 'max', 'min', 'mean', 'median'])) #0.42
print(data_normal.groupby('Attrition')[['StockOptionLevel']].agg(['count', 'max', 'min', 'mean', 'median'])) #0.31
StockOptionLevel
count max min mean median
Attrition
0.0 256 3 0 0.871094 1
1.0 46 3 0 0.434783 0
StockOptionLevel
count max min mean median
Attrition
0.0 977 3 0 0.838280 1
1.0 191 3 0 0.549738 0
data_talent['WorkLifeBalance'].unique()
array([3, 2, 1, 4])
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['WorkLifeBalance'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['WorkLifeBalance'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['WorkLifeBalance'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['WorkLifeBalance'],ax=ax[1])
print(data_talent.groupby('Attrition')[['WorkLifeBalance']].agg(['count', 'max', 'min', 'mean', 'median']))
print(data_normal.groupby('Attrition')[['WorkLifeBalance']].agg(['count', 'max', 'min', 'mean', 'median']))
WorkLifeBalance
count max min mean median
Attrition
0.0 256 4 1 2.777344 3
1.0 46 4 1 2.739130 3
WorkLifeBalance
count max min mean median
Attrition
0.0 977 4 1 2.781986 3
1.0 191 4 1 2.638743 3
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['StockOptionLevel'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['StockOptionLevel'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['StockOptionLevel'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['StockOptionLevel'],ax=ax[1])
print(data_talent.groupby('Attrition')[['StockOptionLevel']].agg(['count', 'max', 'min', 'mean', 'median']))
print(data_normal.groupby('Attrition')[['StockOptionLevel']].agg(['count', 'max', 'min', 'mean', 'median']))
StockOptionLevel
count max min mean median
Attrition
0.0 256 3 0 0.871094 1
1.0 46 3 0 0.434783 0
StockOptionLevel
count max min mean median
Attrition
0.0 977 3 0 0.838280 1
1.0 191 3 0 0.549738 0
data_talent['HomeDistance_range']= data_talent['HomeDistance_range'].astype(int)
data_normal['HomeDistance_range']= data_normal['HomeDistance_range'].astype(int)
print(data_talent.groupby('Attrition')[['HomeDistance_range']].agg(['count', 'max', 'min', 'mean', 'median']))
print('\n')
print(data_normal.groupby('Attrition')[['HomeDistance_range']].agg(['count', 'max', 'min', 'mean', 'median']))
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['HomeDistance_range'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['HomeDistance_range'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['HomeDistance_range'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['HomeDistance_range'],ax=ax[1])
HomeDistance_range
count max min mean median
Attrition
0.0 256 4 1 2.351562 2
1.0 46 4 1 2.869565 3
HomeDistance_range
count max min mean median
Attrition
0.0 977 4 1 2.399181 2
1.0 191 4 1 2.581152 3
<AxesSubplot:xlabel='HomeDistance_range', ylabel='Density'>
#똑같은 방법으로 Monthly income range
data_talent['Income_range']= data_talent['Income_range'].astype(int)
data_normal['Income_range']= data_normal['Income_range'].astype(int)
print(data_talent.groupby('Attrition')[['Income_range']].agg(['count', 'max', 'min', 'mean', 'median']))
print('\n')
print(data_normal.groupby('Attrition')[['Income_range']].agg(['count', 'max', 'min', 'mean', 'median']))
f, ax = plt.subplots(1,2, figsize=(20,10))
sns.distplot(data_talent[data_talent['Attrition']==1]['Income_range'],ax=ax[0])
sns.distplot(data_talent[data_talent['Attrition']==0]['Income_range'],ax=ax[0])
sns.distplot(data_normal[data_normal['Attrition']==1]['Income_range'],ax=ax[1])
sns.distplot(data_normal[data_normal['Attrition']==0]['Income_range'],ax=ax[1])
Income_range
count max min mean median
Attrition
0.0 256 4 1 2.570312 3
1.0 46 4 1 1.739130 1
Income_range
count max min mean median
Attrition
0.0 977 4 1 2.594678 3
1.0 191 4 1 2.099476 2
<AxesSubplot:xlabel='Income_range', ylabel='Density'>
data.groupby(['Department','Attrition'])['Attrition'].count().to_frame()
| Attrition | ||
|---|---|---|
| Department | Attrition | |
| Human Resources | 0.0 | 51 |
| 1.0 | 12 | |
| Research & Development | 0.0 | 828 |
| 1.0 | 133 | |
| Sales | 0.0 | 354 |
| 1.0 | 92 |
data.groupby(['Department','Gender'])['Gender'].count().to_frame()
| Gender | ||
|---|---|---|
| Department | Gender | |
| Human Resources | Female | 20 |
| Male | 43 | |
| Research & Development | Female | 379 |
| Male | 582 | |
| Sales | Female | 189 |
| Male | 257 |
f, ax = plt.subplots(1,2, figsize=(30,15))
data["Department"].value_counts().plot.bar(ax=ax[0],
color=['b','y','g'])
ax[0].set_title("Each Department Number")
sns.countplot('Department', hue="Attrition", data=data, ax=ax[1])
ax[1].set_title("Each Attrition of Departments")
plt.show()
sns.factorplot('Department','Attrition', data=data)
plt.show()
HumanResources = 20%
Research & Development= 13.83%
Sales = 20%
Research & Development는 평균 퇴직 비율에 비해 3% 작고
HumanResources와 Sales 부서는 평균 퇴직 비율에 비해 4% 높게 나타난다.
data["BusinessTravel"].value_counts().to_frame()
| BusinessTravel | |
|---|---|
| Travel_Rarely | 1043 |
| Travel_Frequently | 277 |
| Non-Travel | 150 |
data.groupby(['BusinessTravel','Attrition'])['Attrition'].count().to_frame()
| Attrition | ||
|---|---|---|
| BusinessTravel | Attrition | |
| Non-Travel | 0.0 | 138 |
| 1.0 | 12 | |
| Travel_Frequently | 0.0 | 208 |
| 1.0 | 69 | |
| Travel_Rarely | 0.0 | 887 |
| 1.0 | 156 |
data.groupby(['BusinessTravel','Department'])['Department'].count().to_frame()
| Department | ||
|---|---|---|
| BusinessTravel | Department | |
| Non-Travel | Human Resources | 6 |
| Research & Development | 97 | |
| Sales | 47 | |
| Travel_Frequently | Human Resources | 11 |
| Research & Development | 182 | |
| Sales | 84 | |
| Travel_Rarely | Human Resources | 46 |
| Research & Development | 682 | |
| Sales | 315 |
f, ax = plt.subplots(1,3,figsize=(30,10))
data['BusinessTravel'].value_counts().plot.bar(ax=ax[0], color=['g','y','b'])
ax[0].set_title("BusinessTravel number")
sns.countplot('BusinessTravel',hue='Attrition', data=data, ax=ax[1])
ax[1].set_title("BusinessTravel & Attrition")
sns.countplot('BusinessTravel',hue='Department', data=data, ax=ax[2])
ax[2].set_title("BusinessTravel & Department")
sns.factorplot('BusinessTravel', 'Attrition', data=data)
plt.show()
ResearchDevelopment0=97/961
ResearchDevelopment1=682/961
ResearchDevelopment2=97/961
Sales0=47/446
Sales1=315/446
Sales2=84/446
HumanResource0=6/63
HumanResource1=46/63
HumanResource2=11/63
print("Research&Development: ", ResearchDevelopment0,
ResearchDevelopment1, ResearchDevelopment2)
print("Sales: ", Sales0, Sales1,Sales2)
print("Human Resource:",HumanResource0,HumanResource1,
HumanResource2)
Research&Development: 0.10093652445369407 0.7096774193548387 0.10093652445369407 Sales: 0.10538116591928251 0.7062780269058296 0.18834080717488788 Human Resource: 0.09523809523809523 0.7301587301587301 0.1746031746031746
출장이 빈번한 이들의 퇴사 비율이 25%로 가장 낮았으며, 출장이 전혀 없는 직장이 약 10%미만의 비율을 보이고 있다.
업무별로 출장비율을 보았을 때, Sales, Human Resource의 출장빈도가 가장 높은 이들이 각각 18.8%, 17.46%를 차지 반면에 R&D는 10.09% 차지
# 업무별로 출장회수에 따른 퇴직자의 비율
sns.factorplot('BusinessTravel', 'Attrition', hue='Department', data=data)
plt.show()
sales, Human Resources에서 유의미한 차이가 발생.
출장이 빈번한 이즉의 퇴사율이 30% 가량으로 평균의 약 2배 수준.
data['JobInvolvement'].value_counts().to_frame()
| JobInvolvement | |
|---|---|
| 3 | 868 |
| 2 | 375 |
| 4 | 144 |
| 1 | 83 |
f, ax = plt.subplots(1,3,figsize=(30,10))
sns.countplot('JobInvolvement', hue="Attrition", data=data,ax=ax[0])
ax[0].set_title('JobInvolvement & Attrition')
sns.countplot('JobInvolvement', hue="Department", data=data,ax=ax[1])
ax[1].set_title('JobInvolvement & Department')
sns.countplot('JobInvolvement', hue="BusinessTravel", data=data,ax=ax[2])
ax[2].set_title('JobInvolvement & BusinessTravel') # 비율이 일정.
Text(0.5, 1.0, 'JobInvolvement & BusinessTravel')
sns.factorplot('JobInvolvement', 'Attrition', data=data)
plt.show()
sns.factorplot('JobInvolvement', 'Attrition',col="Department", data=data)
plt.show()
sales부분에서 차이가 심하게 나타남. 평균적으로 직업적 연관성이 적을수록 평균 퇴사 비율이 16%보다 높게 나타난다.
sns.factorplot('JobInvolvement', 'Attrition',col="BusinessTravel", data=data)
plt.show()
출장이 드문 경우, 직업 연관성이 적을수록 퇴사자 비율이 높다
출장이 빈번한 경우, 직업 연관성이 적을수록 퇴사자 비율이 높다.
출장이 없는 경우에는 유의미한 차이가 없다.
직업연관성이 적고 출장빈도가 높으면 높을수록 퇴사 비율이 증가함을 알 수 있다.
data["JobLevel"].value_counts().to_frame()
| JobLevel | |
|---|---|
| 1 | 543 |
| 2 | 534 |
| 3 | 218 |
| 4 | 106 |
| 5 | 69 |
f, ax = plt.subplots(1,2,figsize=(30,10))
data[['Attrition','JobLevel']].groupby(['JobLevel']).mean().plot.barh(ax=ax[0])
ax[0].set_title('JobLevel, Attrition')
sns.countplot('JobLevel', hue="Attrition", data=data, ax=ax[1])
ax[1].set_title("JobLevel, Attrition")
plt.show()
sns.factorplot('JobLevel', 'Attrition', data=data)
plt.show()
JobLevel 이 낮은 경우 퇴사율이 유의미한 차이를 보이고 있다.
data.groupby(['BusinessTravel','JobLevel'])['JobLevel'].count().to_frame()
| JobLevel | ||
|---|---|---|
| BusinessTravel | JobLevel | |
| Non-Travel | 1 | 48 |
| 2 | 67 | |
| 3 | 20 | |
| 4 | 11 | |
| 5 | 4 | |
| Travel_Frequently | 1 | 104 |
| 2 | 104 | |
| 3 | 40 | |
| 4 | 19 | |
| 5 | 10 | |
| Travel_Rarely | 1 | 391 |
| 2 | 363 | |
| 3 | 158 | |
| 4 | 76 | |
| 5 | 55 |
f, ax = plt.subplots(1,2,figsize=(30,10))
sns.countplot('JobLevel', hue="BusinessTravel", data=data, ax=ax[0])
ax[0].set_title("JobLevel, Attrition")
sns.countplot('JobLevel', hue="Department", data=data, ax=ax[1])
ax[1].set_title("JobLevel, Attrition")
plt.show()
sns.factorplot('JobLevel', 'Attrition',col="Department", data=data)
plt.show()
모든 부서에서 JobLevel이 1인 경우 퇴사율이 낮으며, 특이점으로 Human Resources에서는 3의 JobLevel에서 1과 유사한 정도의 퇴사율이 나타나고 있다.
출장의 빈도는 모든 직종과 직업 지위에 따라 유사한 변화를 보이고 있다.
data["JobRole"].value_counts().to_frame()
| JobRole | |
|---|---|
| Sales Executive | 326 |
| Research Scientist | 292 |
| Laboratory Technician | 259 |
| Manufacturing Director | 145 |
| Healthcare Representative | 131 |
| Manager | 102 |
| Sales Representative | 83 |
| Research Director | 80 |
| Human Resources | 52 |
data.groupby(['JobRole','Department'])['Department'].count().to_frame()
| Department | ||
|---|---|---|
| JobRole | Department | |
| Healthcare Representative | Research & Development | 131 |
| Human Resources | Human Resources | 52 |
| Laboratory Technician | Research & Development | 259 |
| Manager | Human Resources | 11 |
| Research & Development | 54 | |
| Sales | 37 | |
| Manufacturing Director | Research & Development | 145 |
| Research Director | Research & Development | 80 |
| Research Scientist | Research & Development | 292 |
| Sales Executive | Sales | 326 |
| Sales Representative | Sales | 83 |
Sales=[Sales Executive, Sales Representative, Manager ]
Research & Development =[Research Director,Research Scientist, Laboratory Technician, Manufacturing Director ,
Healthcare Representative ,Manager ]
Human Resources =[Human Resources ,Manager]
data.groupby(['JobRole','JobLevel'])['JobLevel'].value_counts().plot.barh()
plt.yticks(size=10)
plt.show()
data.groupby(['JobRole','Department'])['Department'].value_counts().plot.barh()
plt.yticks(size=10)
plt.show()
f, ax = plt.subplots(1,3,figsize=(30,10))
data[data['Department']=="Research & Development"]['JobRole'].value_counts().plot.barh(ax=ax[0])
ax[0].set_title("Research & Development's JobRole")
data[data['Department']=="Sales"]['JobRole'].value_counts().plot.barh(ax=ax[1])
ax[1].set_title("sale's JobRole")
data[data['Department']=="Human Resources"]['JobRole'].value_counts().plot.barh(ax=ax[2])
ax[2].set_title("Human Resources's JobRole")
Text(0.5, 1.0, "Human Resources's JobRole")
data[['Attrition','JobRole']].groupby(['JobRole']).mean().plot.barh()
plt.title("'Attritions of JobRole'")
plt.yticks(size=15)
plt.show()
Laboratory Technician
Sales Representative
Human Resources
각 업무에서 job level이 1에 해당하는 직업이 20% 이상이었으며, Sales Representative의 경우 40% 육박하였다. (jobLevel 1의 비율이 91.5%)
Sales=[Sales Executive(2,3,4), Sales Representative(1,2), Manager(3,4,5) ]
Research & Development =[Research Director(3,4,5), Research Scientist(1,2,3), Laboratory Technician(1,2,3),
'
Manufacturing Director(2,3,4) ,Healthcare Representative(2,3,4) ,Manager(3,4,5) ]
Human Resources =[Human Resources(1,2,3) ,Manager(3,4,5)]
JobLevel 과 JobRole에서 Attrition과의 연관성이 매우 높은 것으로 나타난다.
data["OverTime"].value_counts()
No 1054 Yes 416 Name: OverTime, dtype: int64
f, ax = plt.subplots(1,3,figsize=(30,10))
data[['Attrition','OverTime']].groupby(['OverTime']).mean().plot.barh(ax=ax[0])
ax[0].set_title("Attrition & OverTime")
sns.countplot("OverTime", hue="Department", data=data, ax=ax[1])
sns.countplot("OverTime", hue="BusinessTravel", data=data, ax=ax[2])
<AxesSubplot:xlabel='OverTime', ylabel='count'>
data["OverTime"].replace("Yes", 1 , inplace=True)
data["OverTime"].replace("No", 0 , inplace=True)
sns.factorplot('BusinessTravel','Attrition', hue="OverTime", data=data)
plt.show()
sns.factorplot("OverTime",'Attrition', col='Department', data=data)
plt.show()
초과 근무가 있을 경우 최사 비율이 30%, 초과근무가 없는 경우 퇴사비율이 10%로 3 배가량 차이가 난다.
출장 횟수가 많을 수록, 초과근무도 증가하며, 출장횟수와 초과근무가 많을 수록 퇴사자들도 증가함. 각 부서별로도 초과근무가 있으면, 퇴사비율이 증가하는 경향이 발생.
data['PerformanceRating'].value_counts().to_frame()
# Excellent 와 #Outstanding 만 존재. 유의미한 차이가 있을까?
| PerformanceRating | |
|---|---|
| 3 | 1244 |
| 4 | 226 |
data[["Attrition",'PerformanceRating']].groupby(['PerformanceRating']).mean().plot.barh()
plt.show()
#유의미한 차이 X!!!
data['TrainingTimesLastYear'].value_counts().to_frame()
| TrainingTimesLastYear | |
|---|---|
| 2 | 547 |
| 3 | 491 |
| 4 | 123 |
| 5 | 119 |
| 1 | 71 |
| 6 | 65 |
| 0 | 54 |
data[['TrainingTimesLastYear','JobLevel']].groupby(['JobLevel']).mean().plot.barh()
plt.title("'TrainingTimesLastYear & JobLevel about mean")
plt.legend(loc='best')
plt.show()
data[['TrainingTimesLastYear','JobRole']].groupby(['JobRole']).mean().plot.barh()
plt.title("'TrainingTimesLastYear & JobRole about mean")
plt.legend(loc='best')
plt.show()
data[['TrainingTimesLastYear','Department']].groupby(['Department']).mean().plot.barh()
plt.title("'TrainingTimesLastYear & Depaetment about mean")
plt.legend(loc='best')
plt.show()
data[['TrainingTimesLastYear','Attrition']].groupby(['TrainingTimesLastYear']).mean().plot.barh()
plt.title("'TrainingTimesLastYear & Depaetment about mean")
plt.show()
지난 학습 기간이 0인 그룹이 퇴사 비율이 가장 높았다. 그다음이 4시간인 이들이다.
sns.factorplot('TrainingTimesLastYear','Attrition',col='Department',data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff723a9a6a0>
data[data['TrainingTimesLastYear']==0]['Department'].value_counts().plot.bar()
plt.show()
data[data['TrainingTimesLastYear']==4]['Department'].value_counts().plot.bar()
plt.show()
직무 역할, 직무 수준, 부서에서의 평균 트레이닝 시간이 대부분 비슷하다. 0시간 4시간의 비중이 R&D에 많은 것은 단지 R&D의 직무 인원이 많기 때문이다. 유의미한 차이가 있지는 않는 것 같다. 아마 연차와 연관이 있지 않을까??
data['EnvironmentSatisfaction'].value_counts()
3 453 4 446 2 287 1 284 Name: EnvironmentSatisfaction, dtype: int64
plt.figure(figsize=(3,3))
plt.pie(data['EnvironmentSatisfaction'].value_counts(), labels=data['EnvironmentSatisfaction'].value_counts().index,autopct='%0.1f%%')
plt.show()
#1: 'Low' 2: 'Medium' 3: 'High' 4: 'Very High'
data[['Attrition','EnvironmentSatisfaction']].groupby('Attrition').mean()
| EnvironmentSatisfaction | |
|---|---|
| Attrition | |
| 0.0 | 2.771290 |
| 1.0 | 2.464135 |
data.groupby(['EnvironmentSatisfaction','Attrition'])[['Attrition']].count()
| Attrition | ||
|---|---|---|
| EnvironmentSatisfaction | Attrition | |
| 1 | 0.0 | 212 |
| 1.0 | 72 | |
| 2 | 0.0 | 244 |
| 1.0 | 43 | |
| 3 | 0.0 | 391 |
| 1.0 | 62 | |
| 4 | 0.0 | 386 |
| 1.0 | 60 |
pd.crosstab([data.EnvironmentSatisfaction],[data.Attrition],
margins=True).style.background_gradient(cmap='summer_r')
| Attrition | 0.0 | 1.0 | All |
|---|---|---|---|
| EnvironmentSatisfaction | |||
| 1 | 212 | 72 | 284 |
| 2 | 244 | 43 | 287 |
| 3 | 391 | 62 | 453 |
| 4 | 386 | 60 | 446 |
| All | 1233 | 237 | 1470 |
sns.factorplot('Attrition','EnvironmentSatisfaction',data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff70a329f40>
att_yes_e_s=data[data['Attrition']==1].groupby('EnvironmentSatisfaction').count()[['Attrition']]
att_no_e_s=data[data['Attrition']==0].groupby('EnvironmentSatisfaction').count()[['Attrition']]
display(plt.plot(att_no_e_s['Attrition']))
display(plt.plot(att_yes_e_s['Attrition']))
[<matplotlib.lines.Line2D at 0x7ff727647430>]
[<matplotlib.lines.Line2D at 0x7ff71177c1f0>]
sns.countplot(data['EnvironmentSatisfaction'], hue='Attrition', data=data,dodge = True)
<AxesSubplot:xlabel='EnvironmentSatisfaction', ylabel='count'>
data['JobSatisfaction'].value_counts()
4 459 3 442 1 289 2 280 Name: JobSatisfaction, dtype: int64
sns.factorplot('Attrition','JobSatisfaction',data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff727647e80>
sns.countplot(data['JobSatisfaction'], hue='Attrition', data=data,dodge = True)
<AxesSubplot:xlabel='JobSatisfaction', ylabel='count'>
sns.factorplot('Attrition','EnvironmentSatisfaction',data=data, hue='Gender')
<seaborn.axisgrid.FacetGrid at 0x7ff712398e80>
sns.factorplot('Attrition','JobSatisfaction',data=data, hue='Gender')
<seaborn.axisgrid.FacetGrid at 0x7ff725bce700>
data['RelationshipSatisfaction'].value_counts()
3 459 4 432 2 303 1 276 Name: RelationshipSatisfaction, dtype: int64
sns.factorplot('Attrition','RelationshipSatisfaction',data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff7259f70a0>
sns.countplot(data['RelationshipSatisfaction'], hue='Attrition', data=data,dodge = True)
<AxesSubplot:xlabel='RelationshipSatisfaction', ylabel='count'>
att_yes_r_s=data[data['Attrition']==1].groupby('RelationshipSatisfaction').count()[['Attrition']]
att_no_r_s=data[data['Attrition']==0].groupby('RelationshipSatisfaction').count()[['Attrition']]
display(plt.plot(att_no_r_s['Attrition']))
display(plt.plot(att_yes_r_s['Attrition']))
[<matplotlib.lines.Line2D at 0x7ff725e49f10>]
[<matplotlib.lines.Line2D at 0x7ff70a676220>]
att_yes_r_s
| Attrition | |
|---|---|
| RelationshipSatisfaction | |
| 1 | 57 |
| 2 | 45 |
| 3 | 71 |
| 4 | 64 |
data['WorkLifeBalance'].value_counts()
3 893 2 344 4 153 1 80 Name: WorkLifeBalance, dtype: int64
sns.factorplot('Attrition','WorkLifeBalance',data=data)
<seaborn.axisgrid.FacetGrid at 0x7ff7016b42b0>
sns.countplot(data['WorkLifeBalance'], hue='Attrition', data=data,dodge = True)
<AxesSubplot:xlabel='WorkLifeBalance', ylabel='count'>
pd.crosstab([data.WorkLifeBalance],[data.Attrition],
margins=True).style.background_gradient(cmap='summer_r')
| Attrition | 0.0 | 1.0 | All |
|---|---|---|---|
| WorkLifeBalance | |||
| 1 | 55 | 25 | 80 |
| 2 | 286 | 58 | 344 |
| 3 | 766 | 127 | 893 |
| 4 | 126 | 27 | 153 |
| All | 1233 | 237 | 1470 |
#워라밸 만족도별 퇴사자/전체수
data[data['Attrition']==1].groupby('WorkLifeBalance').count()['Attrition']/data.groupby('WorkLifeBalance').count()['Attrition']
WorkLifeBalance 1 0.312500 2 0.168605 3 0.142217 4 0.176471 Name: Attrition, dtype: float64
workandlife_df=data[data['WorkLifeBalance']<5].groupby('Attrition').sum()[['WorkLifeBalance']]
workandlife_df['1점']=data[data['WorkLifeBalance']==1].groupby('Attrition').sum()[['WorkLifeBalance']]
workandlife_df['2점']=data[data['WorkLifeBalance']==2].groupby('Attrition').sum()[['WorkLifeBalance']]
workandlife_df['3점']=data[data['WorkLifeBalance']==3].groupby('Attrition').sum()[['WorkLifeBalance']]
workandlife_df['4점']=data[data['WorkLifeBalance']==4].groupby('Attrition').sum()[['WorkLifeBalance']]
workandlife_df
| WorkLifeBalance | 1점 | 2점 | 3점 | 4점 | |
|---|---|---|---|---|---|
| Attrition | |||||
| 0.0 | 3429 | 55 | 572 | 2298 | 504 |
| 1.0 | 630 | 25 | 116 | 381 | 108 |
att_yes_Lb_s=data[data['Attrition']==1].groupby('WorkLifeBalance').count()[['Attrition']]
att_no_Lb_s=data[data['Attrition']==0].groupby('WorkLifeBalance').count()[['Attrition']]
display(plt.plot(att_no_Lb_s['Attrition'])) #재직자
display(plt.plot(att_yes_Lb_s['Attrition'])) #퇴사자
[<matplotlib.lines.Line2D at 0x7ff712d36130>]
[<matplotlib.lines.Line2D at 0x7ff712d36640>]
plt.hist(data['YearsAtCompany'],bins=30)
plt.xlabel('years at company')
plt.ylabel('count')
plt.title('Years At Company')
plt.show()
f, axes = plt.subplots(1,2)
f.set_size_inches(12,5)
axes[0].hist(data[data['Attrition']==1][['YearsAtCompany']],bins=20)
axes[1].hist(data[data['Attrition']==0][['YearsAtCompany']],bins=20)
(array([140., 208., 266., 146., 145., 132., 14., 38., 30., 20., 36.,
27., 6., 4., 6., 2., 3., 6., 1., 3.]),
array([ 0. , 1.85, 3.7 , 5.55, 7.4 , 9.25, 11.1 , 12.95, 14.8 ,
16.65, 18.5 , 20.35, 22.2 , 24.05, 25.9 , 27.75, 29.6 , 31.45,
33.3 , 35.15, 37. ]),
<BarContainer object of 20 artists>)
print('퇴사O의 평균 근무기간 : ',data[data['Attrition']==1][['YearsAtCompany']].mean())
print('퇴사X의 평균 근무기간 : ',data[data['Attrition']==0][['YearsAtCompany']].mean())
퇴사O의 평균 근무기간 : YearsAtCompany 5.130802 dtype: float64 퇴사X의 평균 근무기간 : YearsAtCompany 7.369019 dtype: float64
퇴사자들의 평균 근무기간이 퇴사하지 않는 사람들보다 약 2년정도 적다. 5년이내 퇴사자수가 가장 많은데, 회사랑 맞지 않으면 바로 나가버리는 듯.
sns.violinplot(data=data, x=data['Attrition'] ,y=data['YearsAtCompany'],hue='Gender')
<AxesSubplot:xlabel='Attrition', ylabel='YearsAtCompany'>
plt.hist(data['YearsInCurrentRole'])
(array([301., 507., 140., 259., 89., 96., 32., 25., 15., 6.]), array([ 0. , 1.8, 3.6, 5.4, 7.2, 9. , 10.8, 12.6, 14.4, 16.2, 18. ]), <BarContainer object of 10 artists>)
f, axes = plt.subplots(1,2)
f.set_size_inches(12,5)
axes[0].hist(data[data['Attrition']==1][['YearsInCurrentRole']],bins=20)
axes[1].hist(data[data['Attrition']==0][['YearsInCurrentRole']],bins=20)
(array([171., 46., 304., 119., 89., 35., 35., 191., 82., 0., 61.,
27., 22., 9., 13., 10., 6., 7., 4., 2.]),
array([ 0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ,
9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. ]),
<BarContainer object of 20 artists>)
print('퇴사O의 평균 현재업무지속기간 : ',data[data['Attrition']==1][['YearsInCurrentRole']].mean())
print('퇴사X의 평균 현재업무지속기간 : ',data[data['Attrition']==0][['YearsInCurrentRole']].mean())
퇴사O의 평균 현재업무지속기간 : YearsInCurrentRole 2.902954 dtype: float64 퇴사X의 평균 현재업무지속기간 : YearsInCurrentRole 4.484185 dtype: float64
sns.violinplot(data=data, x=data['Attrition'] ,y=data['YearsInCurrentRole'],hue='Gender')
<AxesSubplot:xlabel='Attrition', ylabel='YearsInCurrentRole'>
별 필요없는듯..
plt.hist(data['YearsSinceLastPromotion'])
(array([938., 159., 113., 45., 108., 18., 23., 24., 20., 22.]), array([ 0. , 1.5, 3. , 4.5, 6. , 7.5, 9. , 10.5, 12. , 13.5, 15. ]), <BarContainer object of 10 artists>)
f, axes = plt.subplots(1,2)
f.set_size_inches(12,5)
axes[0].hist(data[data['Attrition']==1][['YearsSinceLastPromotion']],bins=15)
axes[1].hist(data[data['Attrition']==0][['YearsSinceLastPromotion']],bins=15)
(array([471., 308., 132., 43., 56., 43., 26., 60., 18., 13., 5.,
22., 10., 8., 18.]),
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12.,
13., 14., 15.]),
<BarContainer object of 15 artists>)
print('퇴사O의 평균 마지막승진으로부터의 기간 : ',data[data['Attrition']==1][['YearsSinceLastPromotion']].mean())
print('퇴사X의 평균 마지막승진으로부터의 기간 : ',data[data['Attrition']==0][['YearsSinceLastPromotion']].mean())
퇴사O의 평균 마지막승진으로부터의 기간 : YearsSinceLastPromotion 1.945148 dtype: float64 퇴사X의 평균 마지막승진으로부터의 기간 : YearsSinceLastPromotion 2.234388 dtype: float64
마지막 승진으로부터의 기간과 퇴사는 큰 관계가 없다고 보임.
plt.hist(data['YearsWithCurrManager'])
(array([339., 486., 129., 29., 323., 91., 22., 32., 10., 9.]), array([ 0. , 1.7, 3.4, 5.1, 6.8, 8.5, 10.2, 11.9, 13.6, 15.3, 17. ]), <BarContainer object of 10 artists>)
f, axes = plt.subplots(1,2)
f.set_size_inches(12,5)
axes[0].hist(data[data['Attrition']==1][['YearsWithCurrManager']],bins=15)
axes[1].hist(data[data['Attrition']==0][['YearsWithCurrManager']],bins=15)
(array([243., 294., 123., 87., 27., 25., 185., 155., 24., 21., 18.,
14., 3., 5., 9.]),
array([ 0. , 1.13333333, 2.26666667, 3.4 , 4.53333333,
5.66666667, 6.8 , 7.93333333, 9.06666667, 10.2 ,
11.33333333, 12.46666667, 13.6 , 14.73333333, 15.86666667,
17. ]),
<BarContainer object of 15 artists>)
print('퇴사O의 평균 현재매니저와 일한 기간 : ',data[data['Attrition']==1][['YearsWithCurrManager']].mean())
print('퇴사X의 평균 현재매니저와 일한 기간 : ',data[data['Attrition']==0][['YearsWithCurrManager']].mean())
퇴사O의 평균 현재매니저와 일한 기간 : YearsWithCurrManager 2.852321 dtype: float64 퇴사X의 평균 현재매니저와 일한 기간 : YearsWithCurrManager 4.367397 dtype: float64
밑에 heatmap에서 보면, 현재매니저와 일한 기간은 years at company와 0.62로 비교적 높은 상관관계가 있는 것으로 보인다.
years_df=data[['Attrition','YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']]
sns.heatmap(years_df.corr(),annot=True)
<AxesSubplot:>
값이 0.7이상인 것을 뚜렷한 양적 상관관계가 있다고 볼때, years at company, yearn in curren role, years with curren mannager 이 세가지 항목은 다중공선성이 나타난다고 볼 수 있다?!
data_talent = data[(data['SalaryHike_range']==4) & (data['Age']<=33)]
data_normal = data[~(data['SalaryHike_range']==4) | ~(data['Age']<=33)]
data_mz = data[(data['Age']<34)]
data_not_talent_mz= data[~(data['SalaryHike_range']==4) & (data['Age']<34)]
data_talent
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeNumber | EnvironmentSatisfaction | ... | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | BeforeWorkingYears | Age_band | BeforeWorkingYears_band | Rate_range | Income_range | SalaryHike_range | HomeDistance_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 30 | 0.0 | Travel_Rarely | 1358 | Research & Development | 24 | 1 | Life Sciences | 11 | 4 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 1 | 4 | 4 |
| 26 | 32 | 1.0 | Travel_Frequently | 1125 | Research & Development | 16 | 1 | Life Sciences | 33 | 2 | ... | 2 | 6 | 7 | 0 | 1 | 0 | 1 | 2 | 4 | 4 |
| 39 | 33 | 0.0 | Travel_Frequently | 1141 | Sales | 1 | 3 | Life Sciences | 52 | 3 | ... | 3 | 1 | 3 | 5 | 1 | 0 | 1 | 3 | 4 | 1 |
| 44 | 30 | 0.0 | Travel_Frequently | 721 | Research & Development | 1 | 2 | Medical | 57 | 3 | ... | 8 | 3 | 7 | 0 | 1 | 0 | 2 | 2 | 4 | 1 |
| 54 | 26 | 0.0 | Travel_Rarely | 1443 | Sales | 23 | 3 | Marketing | 72 | 3 | ... | 2 | 0 | 0 | 3 | 1 | 0 | 4 | 2 | 4 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1431 | 32 | 0.0 | Travel_Rarely | 801 | Sales | 1 | 4 | Marketing | 2016 | 3 | ... | 10 | 5 | 7 | 0 | 1 | 0 | 4 | 4 | 4 | 1 |
| 1433 | 25 | 0.0 | Travel_Rarely | 1382 | Sales | 8 | 2 | Other | 2018 | 1 | ... | 3 | 0 | 4 | 1 | 0 | 0 | 2 | 2 | 4 | 3 |
| 1438 | 23 | 1.0 | Travel_Frequently | 638 | Sales | 9 | 3 | Marketing | 2023 | 4 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 1 | 4 | 3 |
| 1463 | 31 | 0.0 | Non-Travel | 325 | Research & Development | 5 | 3 | Medical | 2057 | 2 | ... | 4 | 1 | 7 | 1 | 1 | 0 | 1 | 4 | 4 | 2 |
| 1467 | 27 | 0.0 | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | 2064 | 2 | ... | 2 | 0 | 3 | 0 | 1 | 0 | 1 | 3 | 4 | 2 |
125 rows × 39 columns
print("mz핵심인재 평균나이 : ", data_talent['Age'].mean(),"\n그외 평균나이 : ",data_normal['Age'].mean())
mz핵심인재 평균나이 : 28.112 그외 평균나이 : 37.74275092936803
print('mz핵심인재 퇴사율:',data_talent[data_talent['Attrition']==1].count()['Attrition']/data_talent['Attrition'].count())
print('그 외 집단 퇴사율:',data_normal[data_normal['Attrition']==1].count()['Attrition']/data_normal['Attrition'].count())
print('전체 퇴사율:',data[data['Attrition']==1].count()['Attrition']/data['Attrition'].count())
mz핵심인재 퇴사율: 0.216 그 외 집단 퇴사율: 0.15613382899628253 전체 퇴사율: 0.16122448979591836
print("MZ 핵심인재 평균월급 : %.2f \t\t 핵심인재 아닌 그룹 평균월급 : %.2f" %(data_talent['MonthlyIncome'].mean(),data_normal['MonthlyIncome'].mean()))
print("전체 MZ직원 평균월급 : %.2f \t\t 핵심인재 아닌 MZ평균월급 : %.2f" %(data_mz['MonthlyIncome'].mean(),data_not_talent_mz['MonthlyIncome'].mean()))
print('전체 직원의 평균월급: %.2f' %(data['MonthlyIncome'].mean()))
MZ 핵심인재 평균월급 : 4570.37 핵심인재 아닌 그룹 평균월급 : 6682.54 전체 MZ직원 평균월급 : 4361.53 핵심인재 아닌 MZ평균월급 : 4303.39 전체 직원의 평균월급: 6502.93
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(data=data_talent, x='Gender', ax=ax[0])
ax[0].set_title('MZ Talent Group')
sns.countplot(data=data_normal, x='Gender', ax=ax[1])
ax[1].set_title('Normal Group')
for a in ax:
for p in a.patches:
a.annotate(format(p.get_height(),".0f"), (p.get_x()+p.get_width()/2.0, p.get_height()),
ha='center', va='center', size=17, xytext=(0,7), textcoords='offset points')
plt.show()
pd.crosstab(data_talent.Attrition,data_talent.Gender, margins=True).style.background_gradient(cmap='summer_r')
| Gender | Female | Male | All |
|---|---|---|---|
| Attrition | |||
| 0.0 | 41 | 57 | 98 |
| 1.0 | 13 | 14 | 27 |
| All | 54 | 71 | 125 |
plt.figure(figsize=(10,6))
sns.lineplot('Gender', 'Attrition', data=data_talent)
sns.lineplot('Gender', 'Attrition', data=data_normal)
sns.lineplot('Gender', 'Attrition', data=data.groupby('Gender').mean(), color='r', alpha=0.5)
plt.legend(['MZ Talent Group', 'Normal Group', 'All'])
plt.show()
핵심인재 그룹은 여성의 퇴사율이 더 높다.
1: 'Below College' 2: 'College' 3: 'Bachelor' 4: 'Master' 5: 'Doctor'
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(data=data_talent, x='Education', ax=ax[0])
ax[0].set_title('MZ Talent Group')
sns.countplot(data=data_normal, x='Education', ax=ax[1])
ax[1].set_title('Normal Group')
for a in ax:
for p in a.patches:
a.annotate(format(p.get_height(),".0f"), (p.get_x()+p.get_width()/2.0, p.get_height()),
ha='center', va='center', size=17, xytext=(0,7), textcoords='offset points')
plt.show()
pd.crosstab(data_talent.Attrition,data_talent.Education, margins=True).style.background_gradient(cmap='summer_r')
| Education | 1 | 2 | 3 | 4 | 5 | All |
|---|---|---|---|---|---|---|
| Attrition | ||||||
| 0.0 | 18 | 17 | 45 | 14 | 4 | 98 |
| 1.0 | 7 | 4 | 14 | 2 | 0 | 27 |
| All | 25 | 21 | 59 | 16 | 4 | 125 |
plt.figure(figsize=(10,6))
sns.lineplot('Education', 'Attrition', data=data_talent)
sns.lineplot('Education', 'Attrition', data=data_normal)
sns.lineplot('Education', 'Attrition', data=data.groupby('Education').mean(), color='r', alpha=0.5)
plt.legend(['MZ Talent Group', 'Normal Group', 'All'])
plt.show()
pd.crosstab(data_talent.Gender,data_talent.Education, margins=True).style.background_gradient(cmap='summer_r')
| Education | 1 | 2 | 3 | 4 | 5 | All |
|---|---|---|---|---|---|---|
| Gender | ||||||
| Female | 11 | 7 | 27 | 8 | 1 | 54 |
| Male | 14 | 14 | 32 | 8 | 3 | 71 |
| All | 25 | 21 | 59 | 16 | 4 | 125 |
data_talent.groupby(['Attrition', 'Gender', 'Education'])[['Education']].count()
| Education | |||
|---|---|---|---|
| Attrition | Gender | Education | |
| 0.0 | Female | 1 | 7 |
| 2 | 5 | ||
| 3 | 21 | ||
| 4 | 7 | ||
| 5 | 1 | ||
| Male | 1 | 11 | |
| 2 | 12 | ||
| 3 | 24 | ||
| 4 | 7 | ||
| 5 | 3 | ||
| 1.0 | Female | 1 | 4 |
| 2 | 2 | ||
| 3 | 6 | ||
| 4 | 1 | ||
| Male | 1 | 3 | |
| 2 | 2 | ||
| 3 | 8 | ||
| 4 | 1 |
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('Education', 'Attrition', data=data_talent, hue='Gender', ax=ax[0])
ax[0].legend()
ax[0].set_title('MZ Talent Group')
sns.lineplot('Education', 'Attrition', data=data_normal, hue='Gender', ax=ax[1])
ax[1].legend()
ax[1].set_title('Normal Group')
plt.show()
대학 이하의 학력을 가진 여성의 퇴사율이 상대적으로 높음
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(data=data_talent, x='EducationField', ax=ax[0])
ax[0].set_title('Major Talent Group')
ax[0].set_xticklabels(ax[0].get_xticklabels(), rotation=45)
sns.countplot(data=data_normal, x='EducationField', ax=ax[1])
ax[1].set_title('Normal Group')
ax[1].set_xticklabels(ax[1].get_xticklabels(),rotation=45)
for a in ax:
for p in a.patches:
a.annotate(format(p.get_height(),".0f"), (p.get_x()+p.get_width()/2.0, p.get_height()),
ha='center', va='center', size=17, xytext=(0,7), textcoords='offset points')
plt.show()
pd.crosstab(data_talent.Attrition,data_talent.EducationField, margins=True).style.background_gradient(cmap='summer_r')
| EducationField | Human Resources | Life Sciences | Marketing | Medical | Other | Technical Degree | All |
|---|---|---|---|---|---|---|---|
| Attrition | |||||||
| 0.0 | 1 | 48 | 6 | 34 | 6 | 3 | 98 |
| 1.0 | 0 | 12 | 3 | 7 | 1 | 4 | 27 |
| All | 1 | 60 | 9 | 41 | 7 | 7 | 125 |
plt.figure(figsize=(10,6))
sns.lineplot('EducationField', 'Attrition', data=data_talent)
sns.lineplot('EducationField', 'Attrition', data=data_normal)
sns.lineplot('EducationField', 'Attrition', data=data.groupby('EducationField').mean(), color='r', alpha=0.2)
plt.legend(['MZ Talent Group', 'Normal Group','All'])
plt.show()
HR을 공부한 사람들의 퇴사율만 낮음 => 근데 이거는 1명으로 나온 데이터라 무의미함
data_talent.groupby(['Attrition', 'Gender', 'EducationField'])[['EducationField']].count()
| EducationField | |||
|---|---|---|---|
| Attrition | Gender | EducationField | |
| 0.0 | Female | Life Sciences | 16 |
| Marketing | 3 | ||
| Medical | 17 | ||
| Other | 4 | ||
| Technical Degree | 1 | ||
| Male | Human Resources | 1 | |
| Life Sciences | 32 | ||
| Marketing | 3 | ||
| Medical | 17 | ||
| Other | 2 | ||
| Technical Degree | 2 | ||
| 1.0 | Female | Life Sciences | 6 |
| Marketing | 1 | ||
| Medical | 4 | ||
| Other | 1 | ||
| Technical Degree | 1 | ||
| Male | Life Sciences | 6 | |
| Marketing | 2 | ||
| Medical | 3 | ||
| Technical Degree | 3 |
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('EducationField', 'Attrition', data=data_talent, hue='Gender', ax=ax[0])
ax[0].legend()
ax[0].set_title('MZ Talent Group')
sns.lineplot('EducationField', 'Attrition', data=data_normal, hue='Gender', ax=ax[1])
ax[1].legend()
ax[1].set_title('Normal Group')
plt.show()
인원이 적어서 큰 의미 없어보이기는 하지만 그럼에도 불구하고 전반적으로 여성이 더 많이 퇴사하는 것은 여기서도 드러남
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(data=data_talent, x='MaritalStatus', ax=ax[0])
ax[0].set_title('Major Talent Group')
sns.countplot(data=data_normal, x='MaritalStatus', ax=ax[1])
ax[1].set_title('Normal Group')
for a in ax:
for p in a.patches:
a.annotate(format(p.get_height(),".0f"), (p.get_x()+p.get_width()/2.0, p.get_height()),
ha='center', va='center', size=17, xytext=(0,7), textcoords='offset points')
plt.show()
pd.crosstab(data_talent.Attrition,data_talent.MaritalStatus, margins=True).style.background_gradient(cmap='summer_r')
| MaritalStatus | Divorced | Married | Single | All |
|---|---|---|---|---|
| Attrition | ||||
| 0.0 | 25 | 48 | 25 | 98 |
| 1.0 | 3 | 5 | 19 | 27 |
| All | 28 | 53 | 44 | 125 |
plt.figure(figsize=(10,6))
sns.lineplot('MaritalStatus', 'Attrition', data=data_talent)
sns.lineplot('MaritalStatus', 'Attrition', data=data_normal)
sns.lineplot('MaritalStatus', 'Attrition', data=data.groupby('MaritalStatus').mean(), color='r', alpha=0.3)
plt.legend(['MZ Talent Group', 'Normal Group', 'All'])
plt.show()
어린 single의 퇴사율이 높은거는 진짜 문제인 듯 왜 그만둘까...
data_talent.groupby(['Attrition', 'Gender', 'MaritalStatus'])[['MaritalStatus']].count()
| MaritalStatus | |||
|---|---|---|---|
| Attrition | Gender | MaritalStatus | |
| 0.0 | Female | Divorced | 10 |
| Married | 19 | ||
| Single | 12 | ||
| Male | Divorced | 15 | |
| Married | 29 | ||
| Single | 13 | ||
| 1.0 | Female | Divorced | 1 |
| Married | 3 | ||
| Single | 9 | ||
| Male | Divorced | 2 | |
| Married | 2 | ||
| Single | 10 |
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('MaritalStatus', 'Attrition', data=data_talent, hue='Gender', ax=ax[0])
sns.lineplot('MaritalStatus', 'Attrition', data=data.groupby('MaritalStatus').mean(), color='r', alpha=0.3, ax=ax[0])
ax[0].legend()
ax[0].set_title('MZ Talent Group')
sns.lineplot('MaritalStatus', 'Attrition', data=data_normal, hue='Gender', ax=ax[1])
sns.lineplot('MaritalStatus', 'Attrition', data=data.groupby('MaritalStatus').mean(), color='r', alpha=0.3, ax=ax[1])
ax[1].legend()
ax[1].set_title('Normal Group')
plt.show()
성별+결혼상태는 큰 차이 없음
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.countplot(data=data_talent, x='NumCompaniesWorked', ax=ax[0])
ax[0].set_title('Major Talent Group')
sns.countplot(data=data_normal, x='NumCompaniesWorked', ax=ax[1])
ax[1].set_title('Normal Group')
for a in ax:
for p in a.patches:
a.annotate(format(p.get_height(),".0f"), (p.get_x()+p.get_width()/2.0, p.get_height()),
ha='center', va='center', size=17, xytext=(0,7), textcoords='offset points')
plt.show()
pd.crosstab(data_talent.Attrition,data_talent.NumCompaniesWorked, margins=True).style.background_gradient(cmap='summer_r')
| NumCompaniesWorked | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | All |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition | |||||||||||
| 0.0 | 16 | 52 | 5 | 6 | 4 | 2 | 5 | 2 | 5 | 1 | 98 |
| 1.0 | 4 | 16 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | 1 | 27 |
| All | 20 | 68 | 6 | 6 | 5 | 2 | 7 | 4 | 5 | 2 | 125 |
plt.figure(figsize=(10,6))
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data_talent)
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data_normal)
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data.groupby('NumCompaniesWorked').mean(), color='r', alpha=0.3)
plt.legend(['MZ Talent Group', 'Normal Group','All'])
plt.show()
전에 다닌 회사 수가 6,7,9일 때 퇴사율이 올라가기는 하지만 인원 수가 너무 적음 -단정하기 위험함
data_talent.groupby(['Attrition', 'Gender', 'NumCompaniesWorked'])[['NumCompaniesWorked']].count()
| NumCompaniesWorked | |||
|---|---|---|---|
| Attrition | Gender | NumCompaniesWorked | |
| 0.0 | Female | 0 | 5 |
| 1 | 24 | ||
| 2 | 3 | ||
| 4 | 3 | ||
| 5 | 1 | ||
| 6 | 2 | ||
| 7 | 1 | ||
| 8 | 2 | ||
| Male | 0 | 11 | |
| 1 | 28 | ||
| 2 | 2 | ||
| 3 | 6 | ||
| 4 | 1 | ||
| 5 | 1 | ||
| 6 | 3 | ||
| 7 | 1 | ||
| 8 | 3 | ||
| 9 | 1 | ||
| 1.0 | Female | 0 | 2 |
| 1 | 6 | ||
| 2 | 1 | ||
| 6 | 2 | ||
| 7 | 2 | ||
| Male | 0 | 2 | |
| 1 | 10 | ||
| 4 | 1 | ||
| 9 | 1 |
f, ax = plt.subplots(1, 2, figsize=(18, 8))
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data_talent, hue='Gender', ax=ax[0])
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data.groupby('NumCompaniesWorked').mean(), color='r', alpha=0.3, ax=ax[0])
ax[0].legend()
ax[0].set_title('MZ Talent Group')
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data_normal, hue='Gender', ax=ax[1])
sns.lineplot('NumCompaniesWorked', 'Attrition', data=data.groupby('NumCompaniesWorked').mean(), color='r', alpha=0.3, ax=ax[1])
ax[1].legend()
ax[1].set_title('Normal Group')
plt.show()
f,ax=plt.subplots(1,3,figsize=(30,10))
data_talent['Department'].value_counts().plot.bar(ax=ax[0])
print(data_talent['Department'].value_counts().to_frame())
data_talent[['Attrition','Department']].groupby(['Department']).mean().plot.barh(ax=ax[1])
print(data_talent[['Attrition','Department']].groupby(['Department']).mean())
data_talent['Attrition'].value_counts().plot.pie(autopct='%1.1f%%',
ax=ax[2],
fontsize=20)
Department
Research & Development 87
Sales 35
Human Resources 3
Attrition
Department
Human Resources 0.000000
Research & Development 0.229885
Sales 0.200000
<AxesSubplot:ylabel='Attrition'>
f,ax=plt.subplots(1,3,figsize=(30,10))
data_talent['JobLevel'].value_counts().plot.bar(ax=ax[0])
print(data_talent['JobLevel'].value_counts().to_frame())
data_talent[['JobLevel','Department']].groupby(['Department']).mean().plot.barh(ax=ax[1])
print(data_talent[['JobLevel','Department']].groupby(['Department']).mean())
data_talent.groupby(['Department'])['JobLevel'].value_counts().plot.barh(ax=ax[2])
data_talent.groupby(['Department'])['JobLevel'].value_counts().to_frame()
#sns.countflots("JobLevel", hue='Department', data=data_talent, ax=ax[2])
JobLevel
1 67
2 45
3 12
4 1
JobLevel
Department
Human Resources 1.333333
Research & Development 1.482759
Sales 1.828571
| JobLevel | ||
|---|---|---|
| Department | JobLevel | |
| Human Resources | 1 | 2 |
| 2 | 1 | |
| Research & Development | 1 | 54 |
| 2 | 25 | |
| 3 | 7 | |
| 4 | 1 | |
| Sales | 2 | 19 |
| 1 | 11 | |
| 3 | 5 |
sns.factorplot('JobLevel', 'Attrition', data=data_talent)
plt.show()
data_talent.groupby(['JobLevel'])['Attrition'].mean().to_frame()
| Attrition | |
|---|---|
| JobLevel | |
| 1 | 0.343284 |
| 2 | 0.044444 |
| 3 | 0.166667 |
| 4 | 0.000000 |
data_talent.groupby(['JobRole','Department'])['Department'].count().to_frame()
| Department | ||
|---|---|---|
| JobRole | Department | |
| Healthcare Representative | Research & Development | 8 |
| Human Resources | Human Resources | 3 |
| Laboratory Technician | Research & Development | 21 |
| Manager | Research & Development | 2 |
| Sales | 1 | |
| Manufacturing Director | Research & Development | 17 |
| Research Director | Research & Development | 2 |
| Research Scientist | Research & Development | 37 |
| Sales Executive | Sales | 22 |
| Sales Representative | Sales | 12 |
f, ax = plt.subplots(1,3,figsize=(30,10))
data_talent[data_talent['Department']=="Research & Development"]['JobRole'].value_counts().plot.barh(ax=ax[0])
ax[0].set_title("Research & Development's JobRole")
data_talent[data_talent['Department']=="Sales"]['JobRole'].value_counts().plot.barh(ax=ax[1])
ax[1].set_title("sale's JobRole")
data_talent[data_talent['Department']=="Human Resources"]['JobRole'].value_counts().plot.barh(ax=ax[2])
ax[2].set_title("Human Resources's JobRole")
print(data_talent[data_talent['Department']=="Research & Development"]['JobRole'].value_counts())
print("")
print(data_talent[data_talent['Department']=="Sales"]['JobRole'].value_counts())
print("")
print(data_talent[data_talent['Department']=="Human Resources"]['JobRole'].value_counts())
print("")
Research Scientist 37 Laboratory Technician 21 Manufacturing Director 17 Healthcare Representative 8 Manager 2 Research Director 2 Name: JobRole, dtype: int64 Sales Executive 22 Sales Representative 12 Manager 1 Name: JobRole, dtype: int64 Human Resources 3 Name: JobRole, dtype: int64
data_talent[['Attrition','JobRole']].groupby(['JobRole']).mean().plot.barh()
plt.title("'Attritions of JobRole'")
plt.yticks(size=10)
plt.show()
print(data_talent[['Attrition','JobRole']].groupby(['JobRole']).mean())
Attrition JobRole Healthcare Representative 0.000000 Human Resources 0.000000 Laboratory Technician 0.380952 Manager 0.000000 Manufacturing Director 0.058824 Research Director 0.000000 Research Scientist 0.297297 Sales Executive 0.090909 Sales Representative 0.416667
Sales= [Sales Executive(2,3,4) : 22, Sales Representative(1,2) : 12, Manager(3,4,5): 1 ]
Research & Development = [Research Scientist(1,2,3) : 37, Laboratory Technician(1,2,3) : 21,
Manufacturing Director(2,3,4) : 17 ,Healthcare Representative(2,3,4) : 8,
Research Director(3,4,5) : 2, Manager(3,4,5) : 2 ]
Human Resources = [Human Resources(1,2,3) : 3 ]
data_talent['BeforeWorkingYears'].value_counts().plot.barh()
plt.show
data_talent['BeforeWorkingYears'].value_counts().to_frame()
| BeforeWorkingYears | |
|---|---|
| 0 | 62 |
| 1 | 26 |
| 2 | 11 |
| 4 | 8 |
| 3 | 6 |
| 5 | 6 |
| 6 | 2 |
| 7 | 1 |
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
sns.factorplot('BeforeWorkingYears', 'Attrition', data=data_talent)
plt.show()
data_talent.groupby(['BeforeWorkingYears'])['Attrition'].mean().to_frame()
| Attrition | |
|---|---|
| BeforeWorkingYears | |
| 0 | 0.225806 |
| 1 | 0.230769 |
| 2 | 0.181818 |
| 3 | 0.333333 |
| 4 | 0.125000 |
| 5 | 0.166667 |
| 6 | 0.000000 |
| 7 | 0.000000 |
| 10 | 0.000000 |
| 11 | 1.000000 |
| 12 | 0.000000 |
data_talent.groupby(['BeforeWorkingYears', 'JobRole'])['Attrition'].mean().to_frame()
| Attrition | ||
|---|---|---|
| BeforeWorkingYears | JobRole | |
| 0 | Healthcare Representative | 0.000000 |
| Human Resources | 0.000000 | |
| Laboratory Technician | 0.384615 | |
| Manager | 0.000000 | |
| Manufacturing Director | 0.000000 | |
| Research Scientist | 0.238095 | |
| Sales Executive | 0.142857 | |
| Sales Representative | 0.375000 | |
| 1 | Healthcare Representative | 0.000000 |
| Human Resources | 0.000000 | |
| Laboratory Technician | 1.000000 | |
| Manufacturing Director | 0.250000 | |
| Research Scientist | 0.428571 | |
| Sales Executive | 0.000000 | |
| Sales Representative | 0.333333 | |
| 2 | Healthcare Representative | 0.000000 |
| Laboratory Technician | 0.000000 | |
| Manufacturing Director | 0.000000 | |
| Research Director | 0.000000 | |
| Research Scientist | 0.500000 | |
| Sales Executive | 0.000000 | |
| Sales Representative | 1.000000 | |
| 3 | Laboratory Technician | 0.000000 |
| Research Scientist | 0.500000 | |
| Sales Executive | 0.000000 | |
| 4 | Laboratory Technician | 1.000000 |
| Manufacturing Director | 0.000000 | |
| Research Director | 0.000000 | |
| Research Scientist | 0.000000 | |
| Sales Executive | 0.000000 | |
| 5 | Laboratory Technician | 0.333333 |
| Manager | 0.000000 | |
| Sales Executive | 0.000000 | |
| 6 | Manufacturing Director | 0.000000 |
| Sales Executive | 0.000000 | |
| 7 | Manufacturing Director | 0.000000 |
| 10 | Sales Executive | 0.000000 |
| 11 | Sales Executive | 1.000000 |
| 12 | Manufacturing Director | 0.000000 |
sns.factorplot('BeforeWorkingYears', 'JobRole', col='Attrition', data=data_talent)
plt.show()
data_talent['YearsAtCompany'].value_counts().to_frame()
| YearsAtCompany | |
|---|---|
| 1 | 20 |
| 5 | 17 |
| 2 | 13 |
| 3 | 12 |
| 6 | 11 |
| 9 | 9 |
| 10 | 9 |
| 0 | 7 |
| 7 | 6 |
| 8 | 6 |
| 11 | 6 |
| 4 | 4 |
| 13 | 3 |
| 12 | 1 |
| 14 | 1 |
data_talent[['Attrition','YearsAtCompany']].groupby(['YearsAtCompany']).mean().plot.barh()
plt.yticks(size=15)
plt.show()
print(data_talent[['Attrition','YearsAtCompany']].groupby(['YearsAtCompany']).mean())
Attrition YearsAtCompany 0 0.714286 1 0.500000 2 0.153846 3 0.333333 4 0.000000 5 0.058824 6 0.090909 7 0.000000 8 0.000000 9 0.111111 10 0.222222 11 0.000000 12 0.000000 13 0.333333 14 0.000000
sns.factorplot('YearsAtCompany', 'Attrition', col='JobRole', data=data_talent)
plt.show()
data_talent['YearsAtCom']=1
data_talent.loc[data['YearsAtCompany'] <= 3, 'YearsAtCom'] = 1
data_talent.loc[(data['YearsAtCompany'] > 3) & (data['YearsAtCompany'] <= 6), 'YearsAtCom'] = 2
data_talent.loc[(data['YearsAtCompany'] > 6) & (data['YearsAtCompany'] <= 10), 'YearsAtCom'] = 3
data_talent.loc[data['YearsAtCompany'] > 10, 'YearsAtCompany'] = 4
data_talent['YearsAtCom']
7 1
26 3
39 2
44 1
54 1
..
1431 1
1433 2
1438 1
1463 3
1467 2
Name: YearsAtCom, Length: 125, dtype: int64
data_talent[["Attrition",'YearsAtCom']].groupby(['YearsAtCom']).mean().plot.bar()
plt.show()
print(data_talent[["Attrition",'YearsAtCom']].groupby(['YearsAtCom']).mean())
Attrition YearsAtCom 1 0.349206 2 0.062500 3 0.100000
sns.factorplot('YearsAtCom', 'Attrition', col='JobRole', data=data_talent)
plt.show()
sns.factorplot('YearsAtCompany', 'Attrition', col='BeforeWorkingYears', data=data_talent)
plt.show()
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','EnvironmentSatisfaction',data=data_talent,ax=ax[0])
sns.pointplot('Attrition','EnvironmentSatisfaction',data=data_normal,ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='EnvironmentSatisfaction'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','EnvironmentSatisfaction',data=data_talent, hue='Gender',ax=ax[0])
sns.pointplot('Attrition','EnvironmentSatisfaction',data=data_normal, hue='Gender',ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='EnvironmentSatisfaction'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','JobSatisfaction',data=data_talent,ax=ax[0])
sns.pointplot('Attrition','JobSatisfaction',data=data_normal,ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='JobSatisfaction'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','JobSatisfaction',data=data_talent,hue='Gender',ax=ax[0])
sns.pointplot('Attrition','JobSatisfaction',data=data_normal,hue='Gender',ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='JobSatisfaction'>
print('핵심인재 퇴사자: ',data_talent[data_talent['Attrition']==1]['JobSatisfaction'].mean())
print('핵심인재 재직자: ',data_talent[data_talent['Attrition']==0]['JobSatisfaction'].mean()) #0.55
print('no핵심인재 퇴사자: ',data_normal[data_normal['Attrition']==1]['JobSatisfaction'].mean())
print('no핵심인재 재직자: ',data_normal[data_normal['Attrition']==0]['JobSatisfaction'].mean()) #0.28
핵심인재 퇴사자: 2.3333333333333335 핵심인재 재직자: 2.8877551020408165 no핵심인재 퇴사자: 2.4857142857142858 no핵심인재 재직자: 2.7691629955947135
print("핵심인재 직무만족도 평균: ", data_talent['JobSatisfaction'].mean())
print("핵심인재 직무만족도 평균: ", data_normal['JobSatisfaction'].mean())
핵심인재 직무만족도 평균: 2.768 핵심인재 직무만족도 평균: 2.724907063197026
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','RelationshipSatisfaction',data=data_talent,ax=ax[0])
sns.pointplot('Attrition','RelationshipSatisfaction',data=data_normal,ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='RelationshipSatisfaction'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','RelationshipSatisfaction',data=data_talent,hue='Gender',ax=ax[0])
sns.pointplot('Attrition','RelationshipSatisfaction',data=data_normal,hue='Gender',ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='RelationshipSatisfaction'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','WorkLifeBalance',data=data_talent,ax=ax[0])
sns.pointplot('Attrition','WorkLifeBalance',data=data_normal,ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='WorkLifeBalance'>
f, ax = plt.subplots(1,2, figsize=(10,5))
sns.pointplot('Attrition','WorkLifeBalance',data=data_talent,hue='Gender',ax=ax[0])
sns.pointplot('Attrition','WorkLifeBalance',data=data_normal,hue='Gender',ax=ax[1])
<AxesSubplot:xlabel='Attrition', ylabel='WorkLifeBalance'>
즉, mz핵심인재는 워라밸은 크게 신경쓰지 않는다. 다른거 돈, 효율성 등을 더 중요시여긴다블라블라~~
f, ax = plt.subplots(1,2, figsize=(15,5))
data_talent_s = data_talent[data_talent['StockOptionLevel']!=3]
data_normal_s = data_normal[data_normal['StockOptionLevel']!=3]
sns.pointplot('StockOptionLevel','Attrition',data=data_talent_s,ax=ax[0])
sns.pointplot('StockOptionLevel','Attrition',data=data_normal_s,ax=ax[1])
ax[0].set_title("StockOption & Attrition [MZ]",fontsize=13)
ax[1].set_title("StockOption & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 스톡옵션 :",round(data_talent[data_talent['Attrition']==1]['StockOptionLevel'].mean(),4))
print("MZ재직자들 평균 스톡옵션 :",round(data_talent[data_talent['Attrition']==0]['StockOptionLevel'].mean(),4))
print("not MZ퇴직자들 평균 스톡옵션 :",round(data_normal[data_normal['Attrition']==1]['StockOptionLevel'].mean(),4))
print("not MZ재직자들 평균 스톡옵션 :",round(data_normal[data_normal['Attrition']==0]['StockOptionLevel'].mean(),4))
MZ퇴직자들 평균 스톡옵션 : 0.4444 MZ재직자들 평균 스톡옵션 : 0.8673 not MZ퇴직자들 평균 스톡옵션 : 0.5381 not MZ재직자들 평균 스톡옵션 : 0.8432
f, ax = plt.subplots(1,2, figsize=(15,5))
sns.pointplot('HomeDistance_range','Attrition',data=data_talent_s,ax=ax[0])
sns.pointplot('HomeDistance_range','Attrition',data=data_normal_s,ax=ax[1])
ax[0].set_title("HomeDistance & Attrition [MZ]",fontsize=13)
ax[1].set_title("HomeDistance & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 HomeDistance :",round(data_talent[data_talent['Attrition']==1]['HomeDistance_range'].mean(),4))
print("MZ재직자들 평균 HomeDistance :",round(data_talent[data_talent['Attrition']==0]['HomeDistance_range'].mean(),4))
print("not MZ퇴직자들 평균 HomeDistance :",round(data_normal[data_normal['Attrition']==1]['HomeDistance_range'].mean(),4))
print("not MZ재직자들 평균 HomeDistance:",round(data_normal[data_normal['Attrition']==0]['HomeDistance_range'].mean(),4))
MZ퇴직자들 평균 HomeDistance : 2.8148 MZ재직자들 평균 HomeDistance : 2.398 not MZ퇴직자들 평균 HomeDistance : 2.6143 not MZ재직자들 평균 HomeDistance: 2.3885
f, ax = plt.subplots(1,2, figsize=(15,5))
sns.pointplot('Income_range','Attrition',data=data_talent_s,ax=ax[0])
sns.pointplot('Income_range','Attrition',data=data_normal_s,ax=ax[1])
ax[0].set_title("Monthly Income & Attrition [MZ]",fontsize=13)
ax[1].set_title("Monthly & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 Income :",round(data_talent[data_talent['Attrition']==1]['Income_range'].mean(),4))
print("MZ재직자들 평균 Income :",round(data_talent[data_talent['Attrition']==0]['Income_range'].mean(),4))
print("not MZ퇴직자들 평균 Income :",round(data_normal[data_normal['Attrition']==1]['Income_range'].mean(),4))
print("not MZ재직자들 평균 Income :",round(data_normal[data_normal['Attrition']==0]['Income_range'].mean(),4))
MZ퇴직자들 평균 Income : 1.4074 MZ재직자들 평균 Income : 2.2347 not MZ퇴직자들 평균 Income : 2.1095 not MZ재직자들 평균 Income : 2.6203
data_talent["BusinessTravel"].replace("Travel_Frequently", 2, inplace=True)
data_talent["BusinessTravel"].replace("Travel_Rarely", 1, inplace=True)
data_talent["BusinessTravel"].replace("Non-Travel", 0, inplace=True)
data_normal["BusinessTravel"].replace("Travel_Frequently", 2, inplace=True)
data_normal["BusinessTravel"].replace("Travel_Rarely", 1, inplace=True)
data_normal["BusinessTravel"].replace("Non-Travel", 0, inplace=True)
f, ax = plt.subplots(1,2, figsize=(15,5))
sns.pointplot('BusinessTravel','Attrition',data=data_talent_s,ax=ax[0])
sns.pointplot('BusinessTravel','Attrition',data=data_normal_s,ax=ax[1])
ax[0].set_title("BusinessTravel & Attrition [MZ]",fontsize=13)
ax[1].set_title("BusinessTravel & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 Income :",round(data_talent[data_talent['Attrition']==1]['BusinessTravel'].mean(),4))
print("MZ재직자들 평균 Income :",round(data_talent[data_talent['Attrition']==0]['BusinessTravel'].mean(),4))
print("not MZ퇴직자들 평균 Income :",round(data_normal[data_normal['Attrition']==1]['BusinessTravel'].mean(),4))
print("not MZ재직자들 평균 Income :",round(data_normal[data_normal['Attrition']==0]['BusinessTravel'].mean(),4))
MZ퇴직자들 평균 Income : 1.2222 MZ재직자들 평균 Income : 1.0918 not MZ퇴직자들 평균 Income : 1.2429 not MZ재직자들 평균 Income : 1.0537
f, ax = plt.subplots(1,2, figsize=(15,5))
sns.pointplot('Attrition','WorkLifeBalance', data=data_talent_s,ax=ax[0])
sns.pointplot('Attrition','WorkLifeBalance', data=data_normal_s,ax=ax[1])
ax[0].set_title("WorkLifeBalance & Attrition [MZ]",fontsize=13)
ax[1].set_title("WorkLifeBalance & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 Income :",round(data_talent[data_talent['Attrition']==1]['WorkLifeBalance'].mean(),4))
print("MZ재직자들 평균 Income :",round(data_talent[data_talent['Attrition']==0]['WorkLifeBalance'].mean(),4))
print("not MZ퇴직자들 평균 Income :",round(data_normal[data_normal['Attrition']==1]['WorkLifeBalance'].mean(),4))
print("not MZ재직자들 평균 Income :",round(data_normal[data_normal['Attrition']==0]['WorkLifeBalance'].mean(),4))
MZ퇴직자들 평균 Income : 2.7407 MZ재직자들 평균 Income : 2.7245 not MZ퇴직자들 평균 Income : 2.6476 not MZ재직자들 평균 Income : 2.7859
data_talent["OverTime"].replace("Yes", 1, inplace=True)
data_talent["OverTime"].replace("No", 0, inplace=True)
data_normal["OverTime"].replace("Yes", 1, inplace=True)
data_normal["OverTime"].replace("No", 0, inplace=True)
f, ax = plt.subplots(1,2, figsize=(15,5))
sns.pointplot('Attrition','OverTime', data=data_talent_s,ax=ax[0])
sns.pointplot('Attrition','OverTime', data=data_normal_s,ax=ax[1])
ax[0].set_title("OverTime & Attrition [MZ]",fontsize=13)
ax[1].set_title("OverTime & Attrition [not MZ]",fontsize=13)
print("MZ퇴직자들 평균 Income :",round(data_talent[data_talent['Attrition']==1]['OverTime'].mean(),4))
print("MZ재직자들 평균 Income :",round(data_talent[data_talent['Attrition']==0]['OverTime'].mean(),4))
print("not MZ퇴직자들 평균 Income :",round(data_normal[data_normal['Attrition']==1]['OverTime'].mean(),4))
print("not MZ재직자들 평균 Income :",round(data_normal[data_normal['Attrition']==0]['OverTime'].mean(),4))
MZ퇴직자들 평균 Income : 0.5556 MZ재직자들 평균 Income : 0.1633 not MZ퇴직자들 평균 Income : 0.5333 not MZ재직자들 평균 Income : 0.2405